Recommendations with IBM

In this notebook, as project part of the Udacity DataScience nanodegree course, we will build out a number of different methods for making recommendations that can be used for different situations.

Table of Contents

I. Exploratory Data Analysis
II. Rank Based Recommendations
III. User-User Based Collaborative Filtering
IV. Matrix Factorization

Let's get started by importing the necessary libraries and reading in the data.

In [2]:
# 
# import libraries
#
import pandas as pd
import numpy as np
import project_tests as t
import pickle
from subprocess import call

import seaborn as sns
sns.set_style('darkgrid')
import matplotlib.pyplot as plt
%matplotlib inline
In [6]:
# check python and pandas version
import sys
print("Python: {}".format(sys.version))
print("Pandas: {}".format(pd.__version__))
Python: 3.7.3 (default, Apr 24 2019, 15:29:51) [MSC v.1915 64 bit (AMD64)]
Pandas: 1.0.1
In [2]:
# load data
try:
    df = pd.read_csv('data/user-item-interactions.csv')
    df_content = pd.read_csv('data/articles_community.csv')
    del df['Unnamed: 0']
    del df_content['Unnamed: 0']
except FileNotFoundError:
    file1 = 'user-item-interactions.csv'
    file2 = 'articles_community.csv'
    print("The csv files {}, {} don't exist in the given directory. No analysis possible.".format(file1, file2))
    
# success
print("The user-item dataset has {} data points with {} variables each.".format(*df.shape))
print("The articles-community dataset has {} data points with {} variables each.".format(*df_content.shape))
The user-item dataset has 45993 data points with 3 variables each.
The articles-community dataset has 1056 data points with 5 variables each.
In [3]:
# Show df to get an idea of the data
df.head()
Out[3]:
article_id title email
0 1430.0 using pixiedust for fast, flexible, and easier... ef5f11f77ba020cd36e1105a00ab868bbdbf7fe7
1 1314.0 healthcare python streaming application demo 083cbdfa93c8444beaa4c5f5e0f5f9198e4f9e0b
2 1429.0 use deep learning for image classification b96a4f2e92d8572034b1e9b28f9ac673765cd074
3 1338.0 ml optimization using cognitive assistant 06485706b34a5c9bf2a0ecdac41daf7e7654ceb7
4 1276.0 deploy your python model as a restful api f01220c46fc92c6e6b161b1849de11faacd7ccb2
In [4]:
# Show df_content to get an idea of the data
df_content.head()
Out[4]:
doc_body doc_description doc_full_name doc_status article_id
0 Skip navigation Sign in SearchLoading...\r\n\r... Detect bad readings in real time using Python ... Detect Malfunctioning IoT Sensors with Streami... Live 0
1 No Free Hunch Navigation * kaggle.com\r\n\r\n ... See the forest, see the trees. Here lies the c... Communicating data science: A guide to present... Live 1
2 ☰ * Login\r\n * Sign Up\r\n\r\n * Learning Pat... Here’s this week’s news in Data Science and Bi... This Week in Data Science (April 18, 2017) Live 2
3 DATALAYER: HIGH THROUGHPUT, LOW LATENCY AT SCA... Learn how distributed DBs solve the problem of... DataLayer Conference: Boost the performance of... Live 3
4 Skip navigation Sign in SearchLoading...\r\n\r... This video demonstrates the power of IBM DataS... Analyze NY Restaurant data using Spark in DSX Live 4

Part I : Exploratory Data Analysis

Now, we provide some insight into the descriptive statistics of the data.

1. What is the distribution of how many articles a user interacts with in the dataset? Provide a visual and descriptive statistics to assist with giving a look at the number of times each user interacts with an article.

In [5]:
df_content.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1056 entries, 0 to 1055
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   doc_body         1042 non-null   object
 1   doc_description  1053 non-null   object
 2   doc_full_name    1056 non-null   object
 3   doc_status       1056 non-null   object
 4   article_id       1056 non-null   int64 
dtypes: int64(1), object(4)
memory usage: 41.4+ KB
In [6]:
sum(df_content.duplicated())
Out[6]:
0
In [7]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45993 entries, 0 to 45992
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   article_id  45993 non-null  float64
 1   title       45993 non-null  object 
 2   email       45976 non-null  object 
dtypes: float64(1), object(2)
memory usage: 1.1+ MB
In [8]:
# see: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html
# change df article_id to int64, so, both dataframes are using the same datatype
df = df.astype({'article_id': 'int64'})
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45993 entries, 0 to 45992
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   article_id  45993 non-null  int64 
 1   title       45993 non-null  object
 2   email       45976 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.1+ MB

Note: As visible with this df information, there are 17 less emails compared to the number of article_id's and titles. So, there are 17 null email values.

In [9]:
sum(df.duplicated())
Out[9]:
12311
In [10]:
interaction_dup_idx = df.index[df.duplicated()].tolist()
In [11]:
# Don't remove them now, it does not fit to the task 3 below and the sol_1_dict test function !

# but in general, remove the duplicates of the interaction dataset
#df.drop_duplicates(inplace=True)
In [12]:
#df.info()

# result: Now, there 13 less emails, means not all articles are mapped to an email interaction.
#<class 'pandas.core.frame.DataFrame'>
#Int64Index: 33682 entries, 0 to 45992
#Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
#---  ------      --------------  ----- 
# 0   article_id  33682 non-null  int64 
# 1   title       33682 non-null  object
# 2   email       33669 non-null  object
#dtypes: int64(1), object(2)
#memory usage: 1.0+ MB
In [13]:
print("There are {} different articles in the interaction dataset.".format(len(df['article_id'].unique())))
print("In the interaction dataset the smallest article number is: {}".format(min(df['article_id'].unique())))
print("In the interaction dataset the highest article number is: {}".format(max(df['article_id'].unique())))
There are 714 different articles in the interaction dataset.
In the interaction dataset the smallest article number is: 0
In the interaction dataset the highest article number is: 1444
In [14]:
print("There are {} different articles in the articles dataset.".format(len(df_content['article_id'].unique())))
print("In the article dataset the smallest article number is: {}".format(min(df['article_id'].unique())))
print("In the article dataset the highest article number is: {}".format(max(df['article_id'].unique())))
There are 1051 different articles in the articles dataset.
In the article dataset the smallest article number is: 0
In the article dataset the highest article number is: 1444
In [15]:
series_grouped1 = df.groupby(['article_id'])['email'].count()
series_grouped1[:20]
Out[15]:
article_id
0      14
2      58
4      13
8      85
9      10
12    157
14     89
15     26
16     61
18     78
20    248
25     15
26     89
28     42
29     75
30     17
32     64
33    141
34     93
36     18
Name: email, dtype: int64
In [16]:
series_grouped1.values
Out[16]:
array([ 14,  58,  13,  85,  10, 157,  89,  26,  61,  78, 248,  15,  89,
        42,  75,  17,  64, 141,  93,  18,  68,  70, 460,  11,  89, 124,
       115,  20, 140,  11,  13,  16,  20,  55,  29,  57,  69,  57,  11,
        57,  19,  24,  12,  18,  24,  38,   7, 152,   9,  89,  29,  78,
        28, 189, 198,  64,  68,  20,  28,  16, 160,  10, 130, 110, 325,
       127,  54,  75,  44,  10,  37,  75,  42,  26,  22,  33, 352,  59,
        85,  13,  38, 149, 222,  91,  12,  16,   9,  40,   7,  78,  33,
       124, 115,  21,  89,  63,  53, 113,  21, 113,  85,  41,  31,  33,
        68,  16,  20,  41, 198,  33,  48, 169,  37, 155,  42,  23,  46,
        59,  10,  15, 222,  44,  20, 146,  16,  24,  14,  60,  45,  24,
        38,  18,  41,  28,  18,  23,  34,  10,  84,  34,  15,  51,  60,
        90,  15,  18,  18,  65,   9,  85,  34,  19,  25,  22,  23,  85,
        43,   6,  19,  67,   5,  13,  33,  80,   3,   2,  11,  24,  93,
        11,   9,   2,  73,  29,  44,  10,   2,   4, 270,   3,  21,  12,
        10,   7, 134,  35,  13, 151,   2,   1,   4,  47,  15,  32,  10,
         4,   2,  10,  40,   2,  54,   8,  49,  23,  20,  98,  11,  56,
        48,  22,  60,   9,   5,  19,  91,  17,  16, 160,  50,  77,  48,
        12,  48,  68,   2,  12,   6,   2,  60, 113,   8,   8,   4,  17,
        22,  85,  32,  10,   8, 179,  14,  54,   4,   4,  14,  56,  30,
         5,  42,   8,  18,   2,  23,  26,   2,   6, 128,  21, 209,  32,
        46,  43,   4,  21,  64,  20,  21,  10,   3,  18,  38,   2, 124,
         2, 102,  54, 108,   1,   9,  19,  83,  22,   1,  39,  28,  12,
        84,  16, 107,   1,   2,  13,   6,  18, 220,  73, 107,  18,  67,
        36,  58,   4,  32,   2,   6,  12,  18,  31,  21,  95,   1,  41,
        32, 121, 143, 239,  25,   4,  46,  52,  10,   2,   2,  12,   7,
        56,   6,  46,  64,  28,   2,  10,  29,  25,  18,  68,  37, 111,
         4,   5, 108, 151,  40,  84,  41, 101,  44,  99,  34,  15,  35,
         8,  29,  27,  12,  36, 123,  32,   2,  38,  19,  41, 104,  11,
        18,   4,  45,  34,  23,  10,  63,  54,  24,   7, 125,  52,  10,
        10,  26,  44,  11, 105,  14,  30, 135,   2,   6, 126,   2,  10,
        11,  28,  25,  35, 100,  83,  65,   2,  51,  27,  45,  26, 139,
         2,   9,   1,  74,  36, 130,   1,   4,  13,  26,  19,  62,   4,
        16,   7,  14,  17, 116,   5,  80, 234,  66,  74, 182,  51,   8,
        18,   8,  45,   4,  52, 137, 108,  87,  42, 330, 215, 219, 103,
        58,  79,  22,  66,  57,  22,   8,   6,   3,   9,   5,   2,   4,
         2,   6,   4,   1,   5,   9,   2,   9,   3,   8,   5,   2,   4,
         5,   2,   2,   2,   1,   8,   2,   3,   3,   2,   1,   8,   4,
         1,   8,   2,   2,   6,   9,   8,   1,   3,   2,   2,   2,   2,
        26,   6,   6,  11,   9,  12,   2,  12,   8,   4,   8,  14,  23,
         8,  16,  10,  14,  26,   2,  42,  55,  37, 433,  24, 512, 290,
       253, 372, 192,   2,  28,  57, 565, 213, 363,  32, 212,  55, 171,
        21,  38,   6,  78, 116, 168,  73, 442, 145,   6,   2,  32,  16,
        10,   8,   2,   2,   2,  23,   5,   1,   1,   2,   3,   2,   2,
        23,   8,  30,   2,   2,   5,   2,  12,   1,   4,   2,   1,   2,
         4,   2,   5,   7,   6,   6,   8,  12,  32,   1,   4, 473,  24,
       204, 347,  74, 104,  30,  25,  18, 191,   2,   3,  30,  13,  39,
         7,  15, 572,   4,  10, 193,   4,  61,  40,   2, 483, 413,   8,
         4,   4,  47, 614,   4,  95,   6,   6, 160,  58, 183,   2, 148,
        40, 927,  29, 206,   8,   2,   2, 379,  37, 382,   2, 293,   1,
         2,   7,  24,  43, 457, 426,   3,  41, 123,  19,  13, 214,  12,
        16,  22, 627,   2,  69, 185, 418,  10,   4,   2,  11,   2, 189,
        26,   2, 191, 454, 182, 122, 206,  24, 465, 279,   7,  52,  69,
        50, 136,  58,  22,  16,  54, 109,  42,  25,   4,  11, 102,  43,
         6, 113,   3, 163, 155, 131,  71, 138, 643, 120, 937, 336, 671,
       340, 108,  42, 120, 481, 218,  59,  10,   8,   4,  22,   5],
      dtype=int64)
In [17]:
max(series_grouped1.values)
Out[17]:
937
In [18]:
series_grouped1.idxmax()
Out[18]:
1429
In [19]:
# visualise the article - user email count distribution
index_list = list(range(0, len(series_grouped1)))
interaction_dict = {
    'article_id': list(series_grouped1.index),
    'email_count': list(series_grouped1.values),
}
df_grouped1_interact = pd.DataFrame(data=interaction_dict,
                                   columns=['article_id', 'email_count'], index=index_list)
df_grouped1_interact = df_grouped1_interact.nlargest(len(series_grouped1), 'email_count')
df_grouped1_interact.head(10)
Out[19]:
article_id email_count
699 1429 937
625 1330 927
701 1431 671
697 1427 643
652 1364 627
614 1314 614
600 1293 572
526 1170 565
518 1162 512
608 1304 483
In [20]:
# Figure dimension (width, height) in inches
df_grouped1_interact.plot.barh(x='article_id', y='email_count', figsize=[10,120])
plt.title("Article User Interaction Distribution")
plt.ylabel('article id')
plt.xlabel('amount of emails for each article')
plt.show()
In [21]:
series_grouped2 = df.groupby(['article_id', 'email'])['email'].count()
series_grouped2[:20]
Out[21]:
article_id  email                                   
0           2841916b462a2b89d36f4f95ca2d1f42559a5788    1
            384255292a8223e84f05ca1e1deaa450c993e148    3
            451a9a4a4cb1cc4e5f38d04e8859cc3fb275cc66    1
            74ca1ae8b034f7fad73a54d55fb1f58747f00493    1
            8bd0afc488016810c287ac4ec844895d570b0af4    1
            a60b7e945a8f2114d5dfbdd53182ad1d526534e2    1
            ad06c765d31179e56f309438367ecb30e1059620    1
            ca7d48adf2c7394ed5a8776de959fa8047e43d4b    1
            db8ac9b2f552db35750239ada8bfcb59b3ae48c0    1
            df722d3aac72766b93d4a65d8b4ac084a968d684    1
            e667c9a1cd56368dfa2f4b974ab2d848585552d7    1
            e6ed9e15addba353fe3c1f36d865a63fa254b9cc    1
2           0246d11c827f90850ce7062e9554c9d5eeb30027    1
            0286bfe26356436658cf4b29b232f0700f0bb9ce    2
            12815feeacc6f27dff5b3441a54418d2d51001ef    1
            12bb8a9740400ced27ae5a7d4c990ac3b7e3c77d    1
            15a1660b6450e064200f1272d9b3d049cf8cf5f1    1
            1d74fc07ef225ff993b9f80dfba85a6bd2bd55b8    1
            249d60fc4edda28cd8fd76f549ecc43259e07038    1
            26b8f921fac7a4d81f2749d64c10020491281545    1
Name: email, dtype: int64

Note: As we have investigated, for a separate article is may appear that there are emails that are linked several times to this article and not only once.

In [22]:
series_grouped3 = df.groupby(['email'])['article_id'].count()
series_grouped3
Out[22]:
email
0000b6387a0366322d7fbfc6434af145adf7fed1    13
001055fc0bb67f71e8fa17002342b256a30254cd     4
00148e4911c7e04eeff8def7bbbdaf1c59c2c621     3
001a852ecbd6cc12ab77a785efa137b2646505fe     6
001fc95b90da5c3cb12c501d201a915e4f093290     2
                                            ..
ffc6cfa435937ca0df967b44e9178439d04e3537     2
ffc96f8fbb35aac4cb0029332b0fc78e7766bb5d     4
ffe3d0543c9046d35c2ee3724ea9d774dff98a32    32
fff9fc3ec67bd18ed57a34ed1e67410942c4cd81    10
fffb93a166547448a0ff0232558118d59395fecd    13
Name: article_id, Length: 5148, dtype: int64
In [23]:
df.query("email == '0000b6387a0366322d7fbfc6434af145adf7fed1'")
Out[23]:
article_id title email
498 1314 healthcare python streaming application demo 0000b6387a0366322d7fbfc6434af145adf7fed1
599 732 rapidly build machine learning flows with dsx 0000b6387a0366322d7fbfc6434af145adf7fed1
627 173 10 must attend data science, ml and ai confere... 0000b6387a0366322d7fbfc6434af145adf7fed1
4635 1354 movie recommender system with spark machine le... 0000b6387a0366322d7fbfc6434af145adf7fed1
5689 43 deep learning with tensorflow course by big da... 0000b6387a0366322d7fbfc6434af145adf7fed1
5876 1232 country statistics: life expectancy at birth 0000b6387a0366322d7fbfc6434af145adf7fed1
7271 1162 analyze energy consumption in buildings 0000b6387a0366322d7fbfc6434af145adf7fed1
7363 124 python machine learning: scikit-learn tutorial 0000b6387a0366322d7fbfc6434af145adf7fed1
7372 1337 life expectancy at birth by country in total y... 0000b6387a0366322d7fbfc6434af145adf7fed1
8232 349 ibm data science experience white paper - spar... 0000b6387a0366322d7fbfc6434af145adf7fed1
10789 43 deep learning with tensorflow course by big da... 0000b6387a0366322d7fbfc6434af145adf7fed1
13865 288 this week in data science (january 31, 2017) 0000b6387a0366322d7fbfc6434af145adf7fed1
16542 618 can a.i. be taught to explain itself? 0000b6387a0366322d7fbfc6434af145adf7fed1
In [24]:
# visualise the user email - article count distribution
index_list = list(range(0, len(series_grouped3)))
interaction_dict = {
    'email': list(series_grouped3.index),
    'article_id_count': list(series_grouped3.values),
}
df_grouped_interact = pd.DataFrame(data=interaction_dict,
                                   columns=['email', 'article_id_count'], index=index_list)
df_grouped_interact = df_grouped_interact.nlargest(len(series_grouped3), 'article_id_count')
df_grouped_interact.head(40)
Out[24]:
email article_id_count
910 2b6c0f514c2f2b04ad3c4583407dccd0810469ee 364
2426 77959baaa9895a7e2bdc9297f8b27c1b6f2cb52a 363
985 2f5c7feae533ce046f2cb16fb3a29fe00528ed66 170
3312 a37adec71b667b297ed2440a9ff7dad427c7ac85 169
2680 8510a5010a5d4c89f5b07baac6de80cd12cfaf93 160
5005 f8c978bcf2ae2fb8885814a9b85ffef2f54c3c76 158
851 284d0c17905de71e209b376e3309c0b08134f7e2 148
525 18e7255ee311d4bd78f5993a9f09538e459e3fcc 147
4401 d9032ff68d0fd45dfd18c0c5f7324619bb55362c 147
832 276d9d8ca0bf52c780b5a3fc554fa69e74f934a3 145
4032 c60bb0a50c324dad0bffd8809d121246baef372b 145
1792 56832a697cb6dbce14700fca18cffcced367057f 144
3618 b2d2c70ed5de62cf8a1d4ded7dd141cfbbdd0388 142
4198 ceef2a24a2a82031246814b73e029edba51e8ea9 140
2867 8dc8d7ec2356b1b106eb3d723f3c234e03ab3f1e 137
4596 e38f123afecb40272ba4c47cb25c96a9533006fa 136
1733 53db7ac77dbb80d6f5c32ed5d19c1a8720078814 132
2187 6c14453c049b1ef4737b08d56c480419794f91c2 131
5101 fd824fc62b4753107e3db7704cd9e8a4a1c961f1 116
3992 c45f9495a76bf95d2633444817f1be8205ad542d 114
401 12bb8a9740400ced27ae5a7d4c990ac3b7e3c77d 104
1080 3427a5a4065625363e28ac8e85a57a9436010e9c 103
283 0d644205ecefdef33e3346bb3551f5e68dc57c58 102
1531 497935037e41a94d2ae02488d098c7abda9a30bc 102
30 015aaf617598e413a35d6d2249e26b7f3c40adb7 101
4719 e90de4b883d9de64a47774ad7ad49ca6fd69d4fe 101
4437 db1c400ffb74f14390deba2140bd31d2e1dc5c4e 98
2541 7dc02db8b76fffbdfe29542da672d4d5fd5ed4ae 97
959 2e205a44014ca7bdbf07fc32f3c9d17699671d03 96
3607 b2926913d95598ec0c007746d693fe3e466ff2d4 95
1354 4070b8d82484ed99cdb9bbc2ebf4e9aca06fd934 95
1457 463878695aac3acc71e9d7c18e7a3b5d8e1a5456 94
4891 f1ccb4d9d8446f26c6c8ee2a135782f984526860 94
2242 6f2a2814638cb70081ef84e149619eb3f4490f4f 92
2494 7b1389c3204f4205132973e68dfe2d20912df0f2 91
369 11304ae794b552e6c929654daaea245e5b57f03b 91
542 19c7d87e50dd9da96c7d2a980139df1497b94247 89
2099 665470e2d4eb76437965ec71e52b41d55f15a08d 89
4088 c9086fbe74843c4792d030260be1499c558edc03 85
453 157d0aba8d75f1c72e5428e4a64a51906008a43a 84
In [25]:
# Figure dimension (width, height) in inches
df_grouped_interact.plot.barh(x='email', y='article_id_count', figsize=[10,900])
plt.title("User Article Interaction Distribution")
plt.ylabel('email label')
plt.show()

Regarding the distribution as expected, the article user frequency is very low for most of them (expected is <5). There are only few having a high article interaction value >100 and only 2 >300. To get a better understanding, have a look to the statistical information below.

Both distributions (article-user email or user-email-article) show, that the dataset is imbalanced.

In [26]:
df_grouped_interact.describe()
Out[26]:
article_id_count
count 5148.000000
mean 8.930847
std 16.802267
min 1.000000
25% 1.000000
50% 3.000000
75% 9.000000
max 364.000000

Another question is: How many different article titles are there? This value should fit to the amount of unique article id's, which is 714, so, we can be pretty sure having a valid 1:1 mapping of article id and title information.

Afterwards a better visible diagram is plotted compared to the above ones to show the title distribution to emails, which are linked to users by now. We better visualise the median amount of user article interactions with this diagram.

In [27]:
len(df['title'].unique())
Out[27]:
714
In [28]:
# create a new plot, better visible compared to the former ones by using bin arrangements,
# concentration of the skewed part, truncate the long empty tail up to 364 (range up to 60),
# y axis is set according the bar distribution and its hight
email_title_counts = df.groupby(['email'])['title'].count()
fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(email_title_counts, bins=50, range=(1, 65))
ax.set_xticks(np.arange(0, 65, 5))
ax.set_yticks(np.arange(0, 2400, 200))
# use only the y axis grid lines to get a better visibility
ax.grid(which='major', axis='x')
plt.title("Title User Interaction Distribution")
plt.xlabel('amount of titles')
plt.ylabel('amount of users');

Note:
Regarding the next given task - Fill in the median and maximum number of user_article interactios below - and having the investigation about emails and articles/titles above, it is unclear, if each email can be mapped to a different user and to exclude the use case that some emails are from the same user. For each article id there exists a list of emails with mostly different string values.

No dataframe file includes a list of user id's, which would be helpful to distinguish such user use cases.

So as simplification, for the next task we assume that each different email string value is mapped to a different user. Additionally, as interaction with an article we count each email independent if it is from a different or the same user as before.

In [29]:
# Fill in the median and maximum number of user_article interactios below
median_val = int(df.groupby(['email'])['article_id'].count().median())
print("50% of individuals interact with {} number of articles or fewer.".format(median_val))

max_views_by_user = df.groupby(['email'])['article_id'].count().max()
print("The maximum number of user-article interactions by any 1 user is: {}.".format(max_views_by_user))
50% of individuals interact with 3 number of articles or fewer.
The maximum number of user-article interactions by any 1 user is: 364.

2. Explore and remove duplicate articles from the df_content dataframe.

In [30]:
sum(df_content.duplicated(subset='article_id'))
Out[30]:
5
In [31]:
# Find and explore duplicate articles
# see: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html?highlight=duplicated#pandas.DataFrame.duplicated
# as default 'first' is set: Mark duplicates as True except for the first occurrence.
article_dup_idx = df_content.index[df_content.duplicated(subset='article_id')].tolist()
article_dup_idx
Out[31]:
[365, 692, 761, 970, 971]
In [32]:
# to show the whole string content and not a truncated one
pd.set_option('display.max_colwidth', None)

print("Information about the duplicated articles ...\n")
for idx in article_dup_idx:
    print("\033[1mDuplicate: {}\033[0m".format(idx))
    print("\033[1mArticle ID: {}\033[0m".format(df_content.iloc[idx]['article_id']))
    print(df_content.iloc[idx])
Information about the duplicated articles ...

Duplicate: 365
Article ID: 50
doc_body           Follow Sign in / Sign up Home About Insight Data Science Data Engineering Health Data AI 5 * Share\r\n * 5\r\n * \r\n * \r\n\r\nNever miss a story from Insight Data , when you sign up for Medium. Learn more Never miss a story from Insight Data Get updates Get updates Sebastien Dery Blocked Unblock Follow Following Master of Layers, Protector of the Graph, Wielder of Knowledge. #OpenScience\r\n#NoBullshit 2 days ago\r\n--------------------------------------------------------------------------------\r\n\r\nGRAPH-BASED MACHINE LEARNING: PART 2\r\nCOMMUNITY DETECTION AT SCALE\r\nDuring the seven-week Insight Data Engineering Fellows Program recent grads and experienced software engineers learn the latest open source technologies by building a data platform to handle large, real-time datasets.\r\n\r\nSebastien Dery (now a Data Science Engineer at Yewno ) discusses his project on community detection on large datasets.\r\n\r\n\r\n--------------------------------------------------------------------------------\r\n\r\n#tltr : Graph-based machine learning is a powerful tool that can easily be merged\r\ninto ongoing efforts. This work reviews the feasibility of performing community\r\ndetection through a distributed implementation using GraphX. Embedded within the\r\nHadoop ecosystem, this modularity optimization approach allows the study of\r\nnetworks of unprecedented size. This change of scales, previously limited by\r\nRAM, opens exciting perspectives as the self modular structure of complex\r\nsystems have been shown to hold crucial information to understanding their\r\nnature.In my previous post , we discussed the foundation of community detection using modularity\r\noptimization. One major constraint however, is that your graph needs to fit in\r\nmemory. This quickly turns problematic as your number of nodes surpass billions, and\r\nthe number of edges becomes trillions.\r\n\r\nThankfully we can leverage distributed computation systems in order to solve this limitation. To do this we first need to define the state\r\nof a node so that it contains all the information needed during computation;\r\nthis will serve as a basic structure to pass around between the machines of our\r\ndistributed cluster.\r\n\r\n“Node” and “Vertex” are often used interchangeably in the literature. This class\r\nserves as structure for the nodes within the graph.Let’s also briefly review the process behind modularity optimization. This works\r\nby iteratively merging nodes that optimize for local modularity to yield a new, and smaller, graph. Repeat until satisfied.\r\n\r\nTwo great properties emerge from this approach\r\n\r\n 1. Locality : Each node requires knowledge from only its first-degree neighbors. This\r\n    means a minimal amount of data needs to be passed around between clusters.\r\n    This way, you don’t need to extensively jump from node to node across the\r\n    clusters in order to get the necessary information.\r\n 2. Independence : Each local computation occurs independently of the graph layout. Within\r\n    an iteration, every node can asynchronously send its information to its\r\n    neighbors without waiting for a blocking sequential set of operations to\r\n    happen.\r\n\r\nThese are important points to highlight as they make distributed computation a\r\nprime candidate for this memory problem. Turns out we can easily implement the\r\nlogic behind those properties using nothing but a simple iteration and a\r\ndeveloper-defined halting criteria. As previously discussed this can take many forms; here are a few ideas for brainstorming:\r\n\r\n * Scheduled based on a predefined number of iterations\r\n * Hits a specific total number of communities\r\n * Modularity gain between iteration is below a threshold\r\n\r\nSimple iteration over the two stage process of our optimization: transfer and\r\ncompress.Let’s dive into the initial step of transferring community between nodes.\r\nRemember that each node needs the information from its neighbors in order to\r\ncompute the gradient for local modularity.\r\n\r\nTRANSFER\r\nThe best way to do this at scale (when you don’t know where the information\r\nultimately is on disk) is by using distributed transactions (aka passing messages ). This type of architecture is ubiquitous in modern computer software; it is\r\nused as a way for the objects that make up a program to work with each other and\r\nas a way for objects and systems running on different computers (e.g., the\r\nInternet) to interact. In algorithms, you’ll often find it referenced under the\r\nname of Belief Propagation or simply message passing . In the context of community detection, each node sends a message to its\r\nneighbors with content along the lines of:\r\n\r\n“ Hey I’m your friendly neighbor Node 3 from Community 12 ”\r\n\r\nBy independently sending messages to their first degree neighbors, each node can\r\nretrieve all the information necessary for them to optimize for local\r\nmodularity. The content of each message can easily be tweaked thus adding\r\nconsiderable flexibility to your approach.If you’ve ever worked with graphs you’re likely to be very familiar with the\r\nconcepts of vertices and edges . Should we perform the message passing exhaustively you’d basically go through\r\neach vertex and send a message for each of its edges. This is not an\r\nintrinsically bad approach if that’s all you have to work with. Turns out that\r\nin the world of GraphX we have access to a third primitive for easy manipulation of our data: the triplet .\r\n\r\nThe three different types of view allowed within GraphX. Taken from AMPLab .The triplet logically joins the vertex and edge properties for a simplified and\r\nuseful view. Literally, the EdgeTriplet class extends the Edge class by simply adding the srcAttr and dstAttr members containing the source and destination properties respectively.\r\n\r\nBy reducing the triplets view, each node receives N messages corresponding to its N first degree neighbors. sendMsg and mergeMsg are both internal functions which perform the necessary aggregation for the\r\nlocal modularity update. Independently, and in parallel, each node waits for its\r\nturn to reduce all its messages into a coherent local sum of weighted edges, and\r\nmake a decision based on the local modularity deltaQ of each neighboring community.\r\n\r\nA few iterations later, the graph has converged to a local equilibrium (e.g. a\r\nminimal amount of nodes feel the need to change community). The algorithm can\r\nnow progress to the next step of compressing those communities into a compact\r\nrepresentation. This is done by creating a new graph with a new set of nodes\r\n(corresponding to each community) and edges being inferred from the edges during\r\nthe previous computation (e.g. average or sum of external edges).\r\n\r\nCOMPRESSION\r\nWhat function to choose really depends on the use case (e.g. averaging, total\r\nsum, maximum, softmax , etc. are all valid functions, although their respective advantages remains\r\nunclear in this particular scenario). When in doubt, let’s use a simple average.\r\nNote that additional information, say the internal coherence within a community,\r\ncan be propagated in a similar fashion to the condensed node and provide\r\nvaluable information.\r\n\r\nEffect of compressing community into single nodes at each iteration.Finally, here we have a fully functional procedure to perform modularity\r\noptimization on graphs of ridiculously large size, assuming we have enough\r\ncomputers to store all the information on disk.\r\n\r\nCAVEATS AND TIPS\r\nCOMPUTATION TIME\r\nNote that the number of meta-communities naturally decreases at each pass, and\r\nas a consequence most of the computing time is used in the first pass. This\r\nsuggests pre-ordering of the data would hold considerable benefit in terms of\r\ncomputation time.\r\n\r\nOptimizing for node locality at the cluster level means less transfer between\r\nmachines.CONVERGENCE\r\nThis approach does not necessarily converge to the optimal solution . To improve this, multiple iterations can increase confidence over the\r\nstructure of your data. Conveniently, this also offers a proxy for the\r\nprobability of two nodes belonging to the same community.\r\n\r\nLAYOUT\r\nTake into account graph connectivity when determining the usefulness of this\r\nstrategy. For example, for a completely connected and unweighted graph, the\r\noutput will be degenerate. Consider thresholding the graph beforehand to extract\r\na more sparse representation of your data.\r\n\r\nThe adequateness of modularity optimization is dependent on the connectivity\r\npattern of your graph. For example, in a lattice layout this algorithm will\r\nperform rather poorly. Modularity optimization doesn’t guarantee adequate\r\nclustering; thus obtaining a community at the end is not enough to conclusively\r\nsay a node decidedly belongs to that group (or even any group, for that matter).HIERARCHY\r\nThe iterative nature of this process offers a hierarchical view between\r\ncommunities of subsequent iteration. The intermediary step should therefore be\r\nsaved for further investigation as they likely yield valuable information on the\r\nstructural complexity of the data. This saving procedure is not covered in this\r\npost but should be trivial to introduce (insert configuration state into your\r\nfavorite database) between iteration.\r\n\r\nSUMMARY\r\nThis work reviewed the feasibility of performing community detection through a\r\ndistributed implementation using GraphX . Embedded within the Hadoop ecosystem , this modularity optimization approach allows the study of networks of\r\nunprecedented size. This change of scales, previously limited by RAM, opens\r\nexciting perspectives as the self modular structure of complex systems have been\r\nshown to hold crucial information to understanding their nature. This enables,\r\namong others, targeted marketing , market segmentation , gene clustering , topic modeling , etc.\r\n\r\nBeing an unsupervised learning technique and an initial starting point for a lot\r\nof analysis, the low barriers of entry make this approach applicable to a wide\r\nrange of datasets.\r\n\r\nDid I miss something crucial to get you up and running? Have something to add?\r\nWould love to hear your experience with this type of approach!\r\n\r\n\r\n--------------------------------------------------------------------------------\r\n\r\nWant to learn Spark, machine learning with graphs, and other big data tools from\r\ntop data engineers in Silicon Valley or New York? The Insight Data Engineering Fellows Program is a free 7-week professional training where you can build cutting edge big\r\ndata platforms and transition to a career in data engineering.\r\n\r\nLearn more about the program and apply today .\r\n\r\nBig Data Data Science Machine Learning Insight Data Engineering Social Network 5 Blocked Unblock Follow FollowingSEBASTIEN DERY\r\nMaster of Layers, Protector of the Graph, Wielder of Knowledge. #OpenScience\r\n#NoBullshit\r\n\r\nFollowINSIGHT DATA\r\nInsight Fellows Program —Your bridge to careers in Data Science and Data\r\nEngineering.
doc_description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            During the seven-week Insight Data Engineering Fellows Program recent grads and experienced software engineers learn the latest open source technologies by building a data platform to handle large…
doc_full_name                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Graph-based machine learning
doc_status                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Live
article_id                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    50
Name: 365, dtype: object
Duplicate: 692
Article ID: 221
doc_body           Homepage Follow Sign in / Sign up Homepage * Home\r\n * Data Science Experience\r\n * Data Catalog\r\n * \r\n * Watson Data Platform\r\n * \r\n\r\nSusanna Tai Blocked Unblock Follow Following Offering Manager, Watson Data Platform | Data Catalog Oct 30\r\n--------------------------------------------------------------------------------\r\n\r\nHOW SMART CATALOGS CAN TURN THE BIG DATA FLOOD INTO AN OCEAN OF OPPORTUNITY\r\nOne of the earliest documented catalogs was compiled at the great library of\r\nAlexandria in the third century BC, to help scholars manage, understand and\r\naccess its vast collection of literature. While that cataloging process\r\nrepresented a massive undertaking for the Alexandrian librarians, it pales in\r\ncomparison to the task of wrangling the volume and variety of data that modern\r\norganizations generate.\r\n\r\nNowadays, data is often described as an organization’s most valuable asset, but\r\nunless users can easily sift through data artifacts to find the information they\r\nneed, the value of that data may remain unrealized. Catalogs can solve this\r\nproblem by providing an indexed set of information about the organization’s\r\ndata, storing metadata that describes all assets and providing a reference to\r\nwhere they can be found or accessed.\r\n\r\nIt’s not just the size and complexity of the data that makes cataloging a tough\r\nchallenge: organizations also need to be able to perform increasingly\r\ncomplicated operations on that data at high speed, and even in real-time. As a\r\nresult, technology leaders must continually find better ways to solve today’s\r\nversion of the same cataloging challenges faced in Alexandria all those years\r\nago.\r\n\r\nENTER IBM\r\nIBM’s aim with Watson Data Platform is to make data accessible for anyone who uses it. An integral part of Watson\r\nData Platform will be a new intelligent asset catalog, IBM Data Catalog , a solution underpinned by a central repository of metadata describing all the\r\ninformation managed by the platform. Unlike many other catalog solutions on the\r\nmarket, the intelligent asset catalog will also offer full end-to-end\r\ncapabilities around data lifecycle and governance.\r\n\r\nBecause all the elements of Watson Data Platform can utilize the same catalog,\r\nusers will be able to share data with their colleagues more easily, regardless\r\nof what the data is, where it is stored, or how they intend to use it. In this\r\nway, the intelligent asset catalog will unlock the value held within that data\r\nacross user groups — helping organizations use this key asset to its full\r\npotential.\r\n\r\nBREAKING DOWN SILOS\r\nWith Watson Data Platform, data engineers, data scientists and other knowledge\r\nworkers throughout an enterprise can search for, share and leverage assets\r\n(including datasets, files, connections, notebooks, data flows, models and\r\nmore). Assets can be accessed using the Data Science Experience web user interface to analyze data,\r\n\r\nTo collaborate with colleagues, users can put assets into a Project that acts as\r\na shared sandbox where the whole team can access and utilize them. Once their\r\nwork is complete, they can submit any resulting content to the catalog for\r\nfurther reuse by other people and groups across the organization.\r\n\r\nRich metadata about each asset makes it easy for knowledge workers to find and\r\naccess relevant resources. Along with data files, the catalog can also include\r\nconnections to databases and other data sources, both on- and off-premises,\r\ngiving users a full 360-degree view to all information relevant to their\r\nbusiness, regardless of where or how it is stored.\r\n\r\nMANAGING DATA OVER TIME\r\nIt’s important to look at data as an evolving asset, rather than something that\r\nstays fixed over time. To help manage and trace this evolution, IBM Data Catalog\r\nwill keep a complete track of which users have added or modified each asset, so\r\nthat it is always clear who is responsible for any changes.\r\n\r\nSMART CATALOG CAPABILITIES FOR BIG DATA MANAGEMENT\r\nThe concept of catalogs may be simple, but when they’re being used to make sense\r\nof huge amounts of constantly changing data, smart capabilities make all the\r\ndifference. Here are some of the key smart catalog functionalities that we see\r\nas integral to tackling the big data challenge.\r\n\r\nDATA AND ASSET TYPE AWARENESS\r\nWhen a user chooses to preview or view an asset of a particular type, the data\r\nand asset type awareness feature will automatically launch the data in the best\r\nviewer — such as a shaper for a dataset, or a canvas for a data flow. This will\r\nsave time and boost productivity for users, optimizing discovery and making it\r\neasier to work with a variety of data types without switching tools.\r\n\r\nINTELLIGENT SEARCH AND EXPLORATION\r\nBy combining metadata, machine learning-based algorithms and user interaction\r\ndata, it is possible to fine-tune search results over time. Presenting users\r\nwith the most relevant data for their purpose will increase usefulness of the\r\nsolution the more it is used.\r\n\r\nSOCIAL CURATION\r\nEffective use of data throughout your organization is a two-way street: when\r\nusers discover a useful dataset, it’s important for them to help others find it\r\ntoo. Users can be encouraged to engage by taking advantage of curation features,\r\nenabling them to tag, rank and comment on assets within the catalog. By\r\naugmenting the metadata for each asset, this can help the catalog’s intelligent\r\nsearch algorithms guide users to the assets that are most relevant to their\r\nneeds.\r\n\r\nDATA LINEAGE\r\nIf data is incomplete or inaccurate, utilizing it can cause more problems than\r\nit solves. On the other hand, if data is accurate but users do not trust it,\r\nthey might not use it when it could make a real difference. In either scenario,\r\ndata lineage can help.\r\n\r\nData lineage captures the complete history of an asset in the catalog: from its\r\noriginal source, through all the operations and transformations it has\r\nundergone, to its current state. By exploring this lineage, users can be\r\nconfident they know where assets have come from, how those assets have evolved,\r\nand whether they can be trusted.\r\n\r\nMONITORING\r\nTaking a step back to a higher-level view, monitoring features will help users\r\nkeep track of overall usage of the catalog. Real-time dashboards help chief data\r\nofficers and other data professionals monitor how data is being used, and\r\nidentify ways to increase its usage in different areas of the organization.\r\n\r\nMETADATA DISCOVERY\r\nWe have already mentioned that data needs to be seen as an evolving asset —\r\nwhich means our catalogs must evolve with it. We plan to make it easy for users\r\nto augment assets with metadata manually; in the future, it may also be possible\r\nto integrate algorithms that can discover assets and capture their metadata\r\nautomatically.\r\n\r\nDATA GOVERNANCE\r\nFor many organizations, keeping data secure while ensuring access for authorized\r\nusers is one of the most significant information management challenges. You can\r\nmitigate this challenge with rule-based access control and automatic enforcement\r\nof data governance policies.\r\n\r\nAPIS\r\nFinally, the catalog will enable access to all these capabilities and more\r\nthrough a set of well-defined, RESTful APIs. IBM is committed to offering\r\napplication developers easy access to additional components of Watson Data Platform , such as persistence stores and data sets. We hope that they can use our\r\nservices to extend their current suite of data and analytics tools, to innovate\r\nand create smart new ways of working with the data.\r\n\r\nLearn more about IBM Data Catalog\r\n\r\n\r\n--------------------------------------------------------------------------------\r\n\r\nWritten by Jay Limburn\r\nDistinguished Engineer and Offering Lead, Watson Data Platform\r\n\r\nOriginally published at www.ibm.com on August 1, 2017\r\n\r\n * Data Catalog\r\n * Data Management\r\n * Data Analytics\r\n * IBM\r\n * Ibm Watson\r\n\r\nA single golf clap? Or a long standing ovation?By clapping more or less, you can signal to us which stories really stand out.\r\n\r\nBlocked Unblock Follow FollowingSUSANNA TAI\r\nOffering Manager, Watson Data Platform | Data Catalog\r\n\r\nFollowIBM WATSON DATA PLATFORM\r\nBuild smarter applications and quickly visualize, share, and gain insights\r\n\r\n * \r\n * \r\n * \r\n * \r\n\r\nNever miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates
doc_description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             One of the earliest documented catalogs was compiled at the great library of Alexandria in the third century BC, to help scholars manage, understand and access its vast collection of literature…
doc_full_name                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      How smart catalogs can turn the big data flood into an ocean of opportunity
doc_status                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Live
article_id                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 221
Name: 692, dtype: object
Duplicate: 761
Article ID: 398
doc_body           Homepage Follow Sign in Get started Homepage * Home\r\n * Data Science Experience\r\n * Data Catalog\r\n * IBM Data Refinery\r\n * \r\n * Watson Data Platform\r\n * \r\n\r\nSourav Mazumder Blocked Unblock Follow Following Nov 27\r\n--------------------------------------------------------------------------------\r\n\r\nUSING APACHE SPARK AS A PARALLEL PROCESSING FRAMEWORK FOR ACCESSING REST BASED\r\nDATA SERVICES\r\nToday’s world of data science leverages data from various sources. Commonly,\r\nthese sources are Hadoop File System, Enterprise Data Warehouse, Relational\r\nDatabase systems, Enterprise file systems, etc. The data from these sources are\r\naccessed in bulk using connectors specific to the underlying technology and\r\noptimized for accessing large volume of data.\r\n\r\nHowever, many a times, a data science exploration/modeling exercise also needs\r\nto access data from sources that support only API-based data access. These\r\nAPI-based data sources/data services can be of various types. For example:\r\n\r\n * Data services (external or internal), which can provide curated/enriched data\r\n   in record-by-record manner.\r\n * Validation services for verifying the data using an API. For example Address\r\n   validation.\r\n * Machine learning/AI services, which provide prediction, recommendations, and\r\n   insights based on a single input record.\r\n * Service from internal systems (like CRM, MDM, etc.) of the organization,\r\n   which supports data access through API only in record-by-record manner.\r\n * And many more …\r\n\r\nThese API-based data services are commonly implemented using REST architectural\r\nstyle ( https://en.wikipedia.org/wiki/Representational_state_transfer ) and are designed to be called for single item (or a limited set of items) per\r\nrequest. While this works well when the API needs to be called from an online\r\napplication, the approach breaks down in situations when the API has to be\r\ncalled in bulk. For example, during an online sign-up process an address\r\nvalidation API can be called for the particular address of the user. But, say in\r\na health care analytics application, where addresses of thousands of doctors,\r\nwhich already exist in a database or were obtained as part of a bulk load from\r\nan external source, have to be verified, this approach will not work. Because of\r\nthe “single item per request” design of the API, you’d have to call the API\r\nthousands of times.\r\n\r\nCalling data service APIs in sequence — Processing Time = (# of Records)*(API\r\nresponse time)\r\n--------------------------------------------------------------------------------\r\n\r\nThe above pseudo code snippet shows how calling a target REST API service is\r\nhandled in a sequential manner. You must first load the list of parameter values\r\nfrom a file or table in the memory. Next run a loop. In the loop, the target\r\nREST API has to be called for each set of parameter values. From the response\r\nreturned by each call the output must be extracted. The output is typically\r\npopulated in a complex object like JSON, XML, etc. Next, the necessary part of\r\nthe output has to be added to a result array or collection. For that, you must\r\nknow the schema of the result beforehand so that you can process the result\r\naccordingly. Finally, you can filter, exploring, aggregating data from the\r\nresult array or collection. For all of these steps, you have to use\r\nlanguage-specific complex code.\r\n\r\nAlternatively, you could use a programming language-specific library related to\r\nmulti-processing/multi-threading that can parallelize the call to the API.\r\nHowever, with that approach the parallelization achieved from a single machine\r\nwould be minuscule — limited to the number of cores of the machine. Consider, a\r\ncase where someone is trying to get personality insights from tweets or Facebook\r\ncomments using a Natural Language Processing service. The tweets and comments\r\ncan be in tens to hundreds of thousands. So, using a single machine could take a\r\nnumber of hours to get the result. Hence, the approach should be to use a\r\ndistributed processing framework to make the API calls parallelized using\r\nmultiple cores of multiple machines with the least coding effort. Though it is\r\npossible to get distributed computing libraries or frameworks to achieve the\r\nsame in some programming languages like Java, C++ etc., they require a\r\nreasonable amount of coding and setup to achieve the same result. Achieving this\r\nin popular data science languages, like R or Python is actually more difficult\r\nas they are originally designed to run in single threaded/single machine\r\nenvironment.\r\n\r\nHere enters distributed computing frameworks like Apache Spark ( https://spark.apache.org/ ). REST APIs are inherently conducive to parallelization as each call to the\r\nAPI is completely independent of any other call to the same API. This fact, in\r\nconjunction with the parallel computing capability of Spark, can be leveraged to\r\ncreate a solution that solves the problem by delegating the API call to Spark’s\r\nparallel workers. Under this approach, one can package a specification for how\r\nto call the API along with the input data, and pass that to Spark to divide the\r\neffort among its workers (and tasks). The output can be assembled in set-level\r\nabstractions supported by Spark (like dataframes or data sets ) and passed back to the calling program. This approach not only helps you turn a\r\nsequential execution into a parallel one with the least coding effort, but also\r\nmakes it much easier to analyze and transform the returned result with an easier\r\ndata abstraction model to work with.\r\n\r\nThe performance benefit you gets is tremendous in this approach. This turns a\r\nproblem that takes incremental time for computation (that increases linearly\r\nwith the number of records to process), to one that is much more efficient and\r\nscales linearly on a much lower slope — number of records to process divided by\r\nthe number of cores available to process them. Theoretically, one can make the\r\nprocess constant time by having enough cores to process ALL of the records at\r\nonce.\r\n\r\nTo enable the benefits of using Spark to call REST APIs, we are introducing a\r\ncustom data source for Spark, namely REST Data Source. It has been built by\r\nextending Spark’s Data Source API. This helps in delegating calls to the target\r\nREST API to a Spark level Task for each set of input parameter values/record.\r\nThis also enables the results from multiple API calls to be returned as one\r\nSpark Dataframe. The REST Data Source expects the input to be in the format of a\r\nSpark Temporary table. The results from the API calls are returned in a single\r\nDataframe of Rows including the input parameters in their corresponding column\r\nnames, as well as the output from the REST call in a structure matching that of\r\nthe target API’s response. You can check the schema of this Dataframe, and\r\naccess the result as necessary using Spark SQL.\r\n\r\nThe architecture of REST Data SourceThe above figure shows how REST Data Source works.\r\n\r\n 1. You first read different sets of parameter values (that have to be sent to\r\n    target REST API) from a file/table to a Spark Dataframe (say Input Data\r\n    Frame).\r\n 2. Then the Input Data Frame is passed to the REST Data Source.\r\n 3. The REST Data Source returns the results to another Dataframe, say Result\r\n    Data Frame.\r\n 4. Now you can use Spark SQL to explore, aggregate, and filter the result using\r\n    the Result Data Frame.\r\n\r\nREST Data Source internally calls the target REST API in parallel by executing\r\nmultiple tasks spawned by multiple worker processes running in different\r\nmachines. Each task is responsible for calling the target REST API Service for a\r\npart of the input (part of sets of parameter values).\r\n\r\nThe code snippet below demonstrates how to use REST Data Source in Python to get\r\nresults from Socrata Data Service (SODA API) for multiple sets of parameter\r\nvalues by calling the appropriate REST API in parallel.\r\n\r\nA sample code snippet showing use of REST Data Source to call REST API in\r\nparallelYou can configure the REST Data Source for different levels of parallelization.\r\nDepending on the volume of input sets of parameter values to be processed and\r\nthroughput supported by the target REST API server, you can pass the number of\r\npartitions to be used, and that can limit or extend the level of parallelization\r\nas needed. You can use this framework in all programming languages supported by\r\nSpark — Python, Scala, R, or Java — without any additional coding specific to\r\nthat programming language. Last, but not the least, you can also use this\r\nframework to ensure that the target API is called only once for a given set of\r\nparameter values. In this way you can avoid calling the target REST API multiple\r\ntimes for same set of parameter values. This is especially useful when you must\r\npay for the REST API being called or there is a limit per day for the same.\r\n\r\nSee \r\nhttps://github.com/sourav-mazumder/Data-Science-Extensions/tree/master/spark-datasource-rest for details of the REST Data Source. Also see this notebook \r\nhttps://dataplatform.ibm.com/analytics/notebooks/ae63f056-e267-443e-bfc0-b9331f51d68a/view?access_token=0ec63c6e031aa57d065a4e1c4b71733729db43b1490c331a44323cce28725b7d for an example of how to use the REST Data Source.\r\n\r\n * Big Data\r\n * Spark\r\n * Artificial Intelligence\r\n * Data Science\r\n * Rest Api\r\n\r\nOne clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.\r\n\r\n9 Blocked Unblock Follow FollowingSOURAV MAZUMDER\r\nMedium member since Nov 2017 FollowIBM WATSON DATA PLATFORM\r\nBuild smarter applications and quickly visualize, share, and gain insights\r\n\r\n * 9\r\n * \r\n * \r\n * \r\n\r\nNever miss a story from IBM Watson Data Platform , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Platform Get updates Get updates
doc_description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Today’s world of data science leverages data from various sources. Commonly, these sources are Hadoop File System, Enterprise Data Warehouse, Relational Database systems, Enterprise file systems, etc…
doc_full_name                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Using Apache Spark as a parallel processing framework for accessing REST based data services
doc_status                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         Live
article_id                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          398
Name: 761, dtype: object
Duplicate: 970
Article ID: 577
doc_body           This video shows you how to construct queries to access the primary index through the API.Visit http://www.cloudant.com/sign-up to sign up for a free Cloudant account.
doc_description                                                                                  This video shows you how to construct queries to access the primary index through the API
doc_full_name                                                                                                                                                        Use the Primary Index
doc_status                                                                                                                                                                            Live
article_id                                                                                                                                                                             577
Name: 970, dtype: object
Duplicate: 971
Article ID: 232
doc_body           Homepage Follow Sign in Get started * Home\r\n * Data Science Experience\r\n * Data Catalog\r\n * IBM Data Refinery\r\n * \r\n * Watson Data Platform\r\n * \r\n\r\nCarmen Ruppach Blocked Unblock Follow Following Offering Manager for Data Refinery on Watson Data Platform at IBM Nov 14, 2017\r\n--------------------------------------------------------------------------------\r\n\r\nSELF-SERVICE DATA PREPARATION WITH IBM DATA REFINERY\r\nIf you are like most data scientists, you are probably spending a lot of time to\r\ncleanse, shape and prepare your data before you can actually start with the more\r\nenjoyable part of building and training machine learning models. As a data\r\nanalyst, you might face similar struggles to obtain data in a format you need to\r\nbuild your reports. In many companies data scientists and analysts need to wait\r\nfor their IT teams to get access to cleaned data in a consumable format.\r\n\r\nIBM Data Refinery addresses this issue. It provides an intuitive self-service\r\ndata preparation environment where you can quickly analyze, cleanse and prepare\r\ndata sets. It is a fully managed cloud service, available in open beta now.\r\n\r\nAnalyze and prepare your data\r\n\r\nWith IBM Data Refinery, you can interactively explore your data and use a wide\r\nrange of transformations to cleanse and transform data into the format you need\r\nfor analysis.\r\n\r\nYou can use a simple point-and-click interface for selecting and combining a\r\nwide range of built-in operations, such as filtering, replacing, and deriving\r\nvalues. It is also possible to quickly remove duplicates, split and concatenate\r\nvalues, and choose from a comprehensive list of text and math operations.\r\n\r\nInteractive data exploration and preparationIf you prefer to code, in IBM Data Refinery you can directly enter R commands\r\nvia R libraries such as dplyr. We provide code templates and in-context\r\ndocumentation to help you become productive with the R syntax more quickly.\r\n\r\nCode templates to help users with R syntaxIf you’re not satisfied with the shaping results, you can easily undo and change\r\noperations in the Steps side bar.\r\n\r\nThe interactive user interface works on a subset of the data to give you a\r\nfaster preview of the operations and results. Once you’re happy with the sample\r\noutput, you can apply the transformations on the entire data set and save all\r\ntransformation steps in a data flow. You can repeat the data flow later and\r\ntrack changes that were applied to your data. To accelerate the job execution,\r\nApache Spark is used as the execution engine.\r\n\r\nProfile and visualize data\r\n\r\nData shaping is an iterative and time-consuming process. In a traditional data\r\nscience workflow, you might use one tool to apply various transformations to\r\nyour data set, and then load the data into another tool to visualize and\r\nevaluate the results. Over many cycles, this continual tool hopping can become\r\nfrustrating.\r\n\r\nIBM Data Refinery soothes the pain by integrating both data transformations and\r\nvisualizations in a single interface, so you can move between views with a\r\nsimple click. You can use the Profile tab to view descriptive statistics of your\r\ndata columns in order to better understand the distribution of values. You can\r\ncontinue to apply transformations and the corresponding profile information\r\nadjusts automatically.\r\n\r\nOn the Visualization tab you can select a combination of columns to build charts\r\nusing Brunel (open source visualization library). IBM Data Refinery\r\nautomatically suggests appropriate plots and you can choose between 12\r\npre-defined chart types. You can adjust the appearance of the charts using\r\nBrunel syntax.\r\n\r\nConnect to your data wherever it resides\r\n\r\nIBM Data Refinery comes with a comprehensive set of 30 prebuilt data connectors\r\nso that you can set up connections to a wide range of commonly used on-premises\r\nand cloud data stores. You can connect to IBM as well as non-IBM services. If\r\nyour data service is hosted on IBM Cloud (formerly IBM Bluemix), you can\r\ndirectly access the data service instance from IBM Data Refinery.\r\n\r\nOnce you specify a connection and connect the data object to your data, you can\r\nstart to analyze and refine your data wherever it resides.\r\n\r\nTry out IBM Data Refinery! Sign up for free at: https://www.ibm.com/cloud/data-refinery\r\n\r\n * Data Science\r\n * Data Visualization\r\n * Data Analysis\r\n * Data Refinery\r\n\r\nOne clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.\r\n\r\n27 Blocked Unblock Follow FollowingCARMEN RUPPACH\r\nOffering Manager for Data Refinery on Watson Data Platform at IBM\r\n\r\nFollowIBM WATSON DATA\r\nBuild smarter applications and quickly visualize, share, and gain insights\r\n\r\n * 27\r\n * \r\n * \r\n * \r\n\r\nNever miss a story from IBM Watson Data , when you sign up for Medium. Learn more Never miss a story from IBM Watson Data Get updates Get updates
doc_description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              If you are like most data scientists, you are probably spending a lot of time to cleanse, shape and prepare your data before you can actually start with the more enjoyable part of building and…
doc_full_name                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             Self-service data preparation with IBM Data Refinery
doc_status                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Live
article_id                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 232
Name: 971, dtype: object

As we have investigated there are 5 different, duplicate articles, which can be removed from this dataset. By default, already the first article occurence is not marked as being a duplicate. It remains in the dataset.

In [33]:
# Remove any rows that have the same article_id - only keep the first
df_content.drop_duplicates(subset='article_id', inplace=True)
df_content.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1051 entries, 0 to 1055
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   doc_body         1037 non-null   object
 1   doc_description  1048 non-null   object
 2   doc_full_name    1051 non-null   object
 3   doc_status       1051 non-null   object
 4   article_id       1051 non-null   int64 
dtypes: int64(1), object(4)
memory usage: 49.3+ KB
In [34]:
# string content as a whole; are there icons as well? => yes, see second item: some emoticons
df_content.head(2)
Out[34]:
doc_body doc_description doc_full_name doc_status article_id
0 Skip navigation Sign in SearchLoading...\r\n\r\nClose Yeah, keep it Undo CloseTHIS VIDEO IS UNAVAILABLE.\r\nWATCH QUEUE\r\nQUEUE\r\nWatch Queue Queue * Remove all\r\n * Disconnect\r\n\r\nThe next video is starting stop 1. Loading...\r\n\r\nWatch Queue Queue __count__/__total__ Find out why CloseDEMO: DETECT MALFUNCTIONING IOT SENSORS WITH STREAMING ANALYTICS\r\nIBM AnalyticsLoading...\r\n\r\nUnsubscribe from IBM Analytics? Cancel UnsubscribeWorking...\r\n\r\nSubscribe Subscribed Unsubscribe 26KLoading...\r\n\r\nLoading...\r\n\r\nWorking...\r\n\r\nAdd toWANT TO WATCH THIS AGAIN LATER?\r\nSign in to add this video to a playlist. Sign in Share More * ReportNEED TO REPORT THE VIDEO?\r\n Sign in to report inappropriate content. Sign in\r\n * Transcript\r\n * Statistics\r\n * Add translations\r\n\r\n175 views 6LIKE THIS VIDEO?\r\nSign in to make your opinion count. Sign in 7 0DON'T LIKE THIS VIDEO?\r\nSign in to make your opinion count. Sign in 1Loading...\r\n\r\nLoading...\r\n\r\nTRANSCRIPT\r\nThe interactive transcript could not be loaded.Loading...\r\n\r\nLoading...\r\n\r\nRating is available when the video has been rented. This feature is not available right now. Please try again later. Published on Nov 6, 2017This video demonstrates a Streaming Analytics application written in Python\r\nrunning in the IBM Data Science experience. The results of the analysis are\r\ndisplayed on a map using Plotly.\r\n\r\nThe notebook demonstrated in this video is available for you to try: http://ibm.biz/WeatherNotebook\r\n\r\nVisit Streamsdev for more articles and tips about Streams: https://developer.ibm.com/streamsdev\r\n\r\nPython API Developer guide: http://ibmstreams.github.io/streamsx....\r\n\r\nStreaming Analytics in Python course: https://developer.ibm.com/courses/all...\r\n\r\n * CATEGORY\r\n * Science & Technology\r\n \r\n \r\n * LICENSE\r\n * Standard YouTube License\r\n \r\n \r\n\r\nShow more Show lessLoading...\r\n\r\nAutoplay When autoplay is enabled, a suggested video will automatically play next.UP NEXT\r\n * The Python ecosystem for Data Science: A guided tour - Christian Staudt -\r\n Duration: 25:41. PyData 1,411 views 25:41\r\n\r\n\r\n--------------------------------------------------------------------------------\r\n\r\n * IBM Streaming Analytics and Python - Duration: 1:00:51. John O'Neill 105\r\n views 1:00:51\r\n * How Customers Are Using the IBM Data Science Experience Expected Cases and\r\n Not So Expected Ones - Duration: 18:29. Databricks 327 views 18:29\r\n * Giovanni Lanzani | Applied Data Science - Duration: 35:14. PyData 2,728 views 35:14\r\n * Detecting Fraud in Real-Time with Azure Stream Analytics - Duration: 32:16.\r\n Philip Howard 71 views 32:16\r\n * Step by step guide how to build a real-time anomaly detection system using\r\n Apache Spark Streaming - Duration: 16:11. Mariusz Jacyno 4,591 views 16:11\r\n * Real-time Analytics with Azure Stream Analytics - Duration: 54:47. PASS\r\n Business Analytics Virtual Group 940 views 54:47\r\n * Real-Time Machine Learning Analytics Using Structured Streaming and Kinesis\r\n Firehose - Duration: 31:25. Databricks 660 views 31:25\r\n * Data Science - Duration: 25:05. manish telang 3 views 25:05\r\n * Real-Time Log Analytics using Amazon Kinesis and Amazon Elasticsearch Service\r\n - Duration: 28:32. Amazon Web Services - Webinar Channel 1,072 views 28:32\r\n * IBM Data Science Experience and Machine Learning Use Cases in Healthcare -\r\n Duration: 26:53. IDEAS 157 views 26:53\r\n * Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud\r\n Services - Duration: 47:06. Kai Wähner 1,761 views 47:06\r\n * An overview of IBM Streaming Analytics for Bluemix - Duration: 44:12. IBM\r\n Analytics 1,311 views 44:12\r\n * Predicting Stock Prices - Learn Python for Data Science #4 - Duration: 7:39.\r\n Siraj Raval 274,452 views 7:39\r\n * REST API concepts and examples - Duration: 8:53. WebConcepts 1,687,034 views 8:53\r\n * Streaming Data Analytics with Apache Spark Streaming - Duration: 1:01:19.\r\n Data Gurus 300 views 1:01:19\r\n * Orchestrate IBM Data Science Experience analytics workflows using Node-RED -\r\n Duration: 10:16. Balaji Kadambi 109 views 10:16\r\n * Delight Clients with Data Science on the IBM Integrated Analytics System -\r\n Duration: 15:05. IBM Analytics 1,581 views 15:05\r\n * What is DevOps? - In Simple English - Duration: 7:07. Rackspace 657,396 views 7:07\r\n * Introduction - Learn Python for Data Science #1 - Duration: 6:55. Siraj Raval\r\n 206,552 views 6:55\r\n * Loading more suggestions...\r\n * Show more\r\n\r\n * Language: English\r\n * Location: United States\r\n * Restricted Mode: Off\r\n\r\nHistory HelpLoading...\r\n\r\nLoading...\r\n\r\nLoading...\r\n\r\n * About\r\n * Press\r\n * Copyright\r\n * Creators\r\n * Advertise\r\n * Developers\r\n * +YouTube\r\n\r\n * Terms\r\n * Privacy\r\n * Policy & Safety\r\n * Send feedback\r\n * Test new features\r\n * \r\n\r\nLoading...\r\n\r\nWorking...\r\n\r\nSign in to add this to Watch LaterADD TO\r\nLoading playlists... Detect bad readings in real time using Python and Streaming Analytics. Detect Malfunctioning IoT Sensors with Streaming Analytics Live 0
1 No Free Hunch Navigation * kaggle.com\r\n\r\n * kaggle.com\r\n\r\nCommunicating data science: A guide to presenting your work 4COMMUNICATING DATA SCIENCE: A GUIDE TO PRESENTING YOUR WORK\r\nMegan Risdal | 06.29.2016\r\n\r\nSee the forest, see the trees . Here lies the challenge in both performing and presenting an analysis. As\r\ndata scientists, analysts, and machine learning engineers faced with fulfilling\r\nbusiness objectives, we find ourselves bridging the gap between The Two Cultures : sciences and humanities. After spending countless hours at the terminal\r\ndevising a creative and elegant solution to a difficult problem, the insights\r\nand business applications are obvious in our minds. But how do you distill them\r\ninto something you can communicate?\r\n\r\nQualifications and requirements for a senior data scientist position.\r\n\r\nPresenting my work is one of the surprising challenges I faced in my recent\r\ntransition from academia to life as a data analyst at a market research and strategy firm . When I was a linguistics PhD student at UCLA studying learnability theory in\r\na classroom or measuring effects of an oral constriction on glottal vibration in\r\na sound booth, my colleagues and I were comfortable speaking the same language.\r\nNow that I work with a much more diverse crowd of co-workers and clients with\r\nvaried backgrounds and types of expertise, I need to work harder to ensure that\r\nthe insights of my analyses are communicated effectively.\r\n\r\nIn this second entry in the communicating data science series , I cover some essentials when it comes to presenting a thorough,\r\ncomprehensible analysis for readers who want (or need) to know how to get their\r\nwork noticed and read.\r\n\r\n\r\n--------------------------------------------------------------------------------\r\n\r\nGET YOUR HEAD IN THE GAME\r\nImagine you’ve just completed the so-called heavy lifting, whatever it may be,\r\nand you’re ready to present your results and conclusions in a report. Well, step\r\naway from the word processor! There are two things you must first consider: your\r\naudience and your goals. This is your forest.\r\n\r\nWHO IS YOUR AUDIENCE?\r\nThe matter of who you’re speaking to will influence every detail of how you\r\nchoose to present your analysis from whether you use technical jargon or spend\r\ntime carefully defining your terms. The formality of the context may determine\r\nwhether a short, fun tangent or personal anecdote will keep your audience\r\nhappily engaged or elicit eye rolls … and worse.\r\n\r\nThis is all important to consider because once you’ve envisioned your audience,\r\nyou take stock of what may and may not be shared knowledge and how to manage\r\ntheir expectations. In your writing (and in everyday life), it’s useful to be\r\ncognizant of Grice’s principles of cooperative communication:\r\n\r\n 1. Maxim of quantity : be informative, without giving overwhelming amounts of extraneous detail.\r\n 2. Maxim of quality : be truthful. Enough said.\r\n 3. Maxim of relation : be relevant. I’ll give you some tips on staying topical shortly!\r\n 4. Maxim of manner : be clear. Don’t be ambiguous, be orderly.\r\n\r\nSo be cooperative! Know your audience and do what you can to anticipate their\r\nexpectations. This will ensure that you cover all ground in exactly as much\r\ndetail as necessary in your report.\r\n\r\nWHAT IS THE GOAL?\r\nAlso before you put pen to paper, it’s helpful remind yourself again (and again)\r\nof what your goal is. If you’re working in a professional environment, you’re\r\naware that it’s important to be continually mindful of the goal or business\r\nproblem and why you’re tasked with solving it.\r\n\r\nOr perhaps it's a strategic initiative you're after: Did you set out to learn\r\nsomething new about some data (and the world)? Or have you been diligently\r\nworking on a new skill you’d like to showcase? Do you want to test out some\r\nideas and get feedback? It’s okay to make it your goal to find out “Can I do\r\nthis?” Maybe you want to share some of your expertise with the community on\r\nKaggle Scripts. In that case, it’s even more imperative that you have a\r\nbuttoned-up analysis!\r\n\r\n"If we can really understand the problem, the answer will come out of it,\r\nbecause the answer is not separate from the problem."\r\n\r\n― Jiddu KrishnamurtiIf you’ve reached the point of having an analysis to report, you’ve more than\r\nlikely familiarized yourself with the goals of the initiative, but you must also\r\nkeep them at the forefront of your thoughts when presenting your results as\r\nwell. Your work should be contextualized in terms of your understanding of the\r\nresearch objectives. Often in my own day job this means synthesizing many\r\nanalyses I’ve performed into a few key pieces of evidence which support a story;\r\nthis can’t be done well except by accident without keeping in mind the ultimate\r\nobjective at hand.\r\n\r\nTHE PREAMBLE\r\nNow that you’ve got yourself in the right frame of mind―you can see the forest\r\nand you know the trees―you’re ready to start thinking about the content of your\r\nreport. However, before you start furiously spilling ink, first remind yourself\r\nof the three elements required to ask an askable question in science:\r\n\r\n 1. The question itself along with some justification of how it addresses your\r\n objectives\r\n 2. A hypothesis\r\n 3. A feasible methodology for addressing your question\r\n\r\nMuch as I implore you to consider who your audience is and what your objectives\r\nare in order to get your mind in the right place, I’m recommending that you have\r\nthe answers to these three things ready because they will dictate the content of\r\nyour report. You don’t want to throw everything and the kitchen sink into a\r\nreport!\r\n\r\nWHAT’S THE QUESTION?\r\nOn Kaggle, the competition hosts very generously provide their burning questions\r\nto the community. Outside of this environment, the challenge is to come up with\r\none on your own or work within the business objectives of your employer. At this\r\npoint, you make sure that you can appropriately state the question and how it\r\nrelates to your objective(s).\r\n\r\nAs an aside, if you need some exercise in the area of asking insightful\r\nquestions (a skill unto its own), I hereby challenge you to scroll through some\r\nof Kagglers’ most recent scripts, find and read one, and think of one new\r\nquestion you could ask the author. If you find that this is a stumbling block\r\npreventing you from proceeding with your analysis, many dataset publishers\r\ninclude a number of questions they’d like to see addressed. Or read the Script of the Week blogs and see what other ideas script authors would like to see explored in the same\r\ndataset.\r\n\r\nWHAT’S THE ANSWER?\r\nNow that you have your question, what do you think the answer will be? It’s good\r\npractice, of course, to consider what the possible answers may be before you dig\r\ninto the data, so hopefully you’ve already done that! Clearly delimiting the\r\nhypothesis space at this point will guide the evidence and arguments you use in\r\nthe body of your report. It will be easier to evaluate what constitutes weak and\r\nstrong support of your theory and what analyses may be absolutely irrelevant.\r\nUltimately you will prevent yourself from attacking straw men in faux support of your theory.\r\n\r\nDon't build straw men.\r\n\r\nWHAT’S YOUR METHODOLOGY?\r\nLet’s say you’re asking whether Twitter users with dense social networks in the How ISIS Uses Twitter dataset express greater negative sentiment than users with less dense networks. Your\r\nfirst step is to confirm that the data available is sufficient to address your\r\nresearch question. If there’s major missing information, you may want to rethink\r\nyour question, revise your methodology, or even collect new data.\r\n\r\nIf you’re unsure of how to put language to a particular methodology, this is a\r\ngood opportunity to flex your Googling skills. Search for “social network\r\nanalysis in r” or “sentiment analysis in python.” Dive into some academic papers\r\nif it's appropriate and see how it's presented. Peruse the natural language processing tags on No Free Hunch and read the winners’ interviews . Get inspiration from scripts on similar datasets on Kaggle. For example, a similar analysis was performed by Kaggle user Khomutov Nikita using the Hillary Clinton’s Emails dataset .\r\n\r\nHillary Clinton's network graph. See the code here .\r\n\r\nEven if you don’t end up needing to share every nuance of your methodology with\r\nyour given audience, you should always document your work thoroughly to the\r\nextent possible. Once you’re ready to present your analysis, you’ll be capable\r\nof determining how much is the right amount to share when discussing the nitty\r\ngritty mechanics of your model. Similarly, I've been able to pleasantly surprise\r\nmy boss many times because I have an answer ready at-hand for immediate\r\nquestions thanks to keeping my exploratory analyses well-documented.\r\n\r\nBy the way, if you’ve felt overwhelmed by the task of putting together a solid\r\nmethodology for tackling a question, it can’t hurt to lob an idea and some code\r\nto the community for feedback. Especially once you have solid presentation of\r\nanalysis skills! Be honest about where you feel you could use extra input and\r\nmaybe a fellow Kaggler will come forth with different angle on the problem.\r\n\r\nPUTTING THE PIECES TOGETHER\r\nFinally, you’re ready to write. Keep in mind that a good analysis should\r\nfacilitate its own interpretation as much as possible. Again, this requires\r\nanticipating what information your likely audience will be seeking and what\r\nknowledge they’re coming in with already. One method which is both\r\ntried-and-true and friendly to the academic nature of the discipline is\r\nfollowing a template for your analysis. With that, this section covers the\r\nstructure which when fleshed out will help you tell the story in the data.\r\n\r\nKeep in mind that a good analysis should facilitate its own interpretation as\r\nmuch as possible.\r\n\r\nNOT SO ABSTRACT\r\nMake it easy for your audience to quickly determine what they’re about to\r\ndigest. Use an abstract or introduction to recall your objectives and clearly\r\nstate them for your readers. What is the problem that you’ve set out to solve?\r\nIf you have a desired outcome or any expectations of your audience, say it, as\r\nthis is the entire reason you’re presenting them with your analysis.\r\n\r\nYou then cover everything from your preamble in this section: the question\r\nyou’ve been on a mission to answer, your hypothesis, and the methodology you’ve\r\nused. Finally, you will often provide a high level summary of your results and\r\nkey findings. Don’t worry about spoiler alerts or boring your readers to death\r\nwith the content that’s about to follow. Trust that if they pay attention past\r\nthe introduction that they are interested in how you achieve what you claim you\r\nhave.\r\n\r\nMany people I've talked to have said that they often find it easier to write the\r\nabstract after having already completely documented the detailed findings of the analysis. I\r\nthink that this is at least in part because you've familiarized yourself with\r\nyour own work through the lens of your readership by doing so. Slowly but surely\r\nyou're extracting yourself from the trees and the bigger picture becomes\r\napparent.\r\n\r\nTHE CONTENT: BREAK OFF WHAT YOU CAN CHEW\r\nThis is where the good stuff lives. You've laid the foundation for your analysis\r\nsuch that your audience is prepared to read or listen intently to your story. I\r\ncan’t tell you the specifics of what goes here, but I can tell you how to\r\nstructure it.\r\n\r\nTake your analysis in small bits by breaking your question into subparts. For a\r\ndata-driven analysis, it can make sense to tackle each piece of evidence\r\none-by-one. You may have a dissertation’s worth of data to report on, but more\r\nlikely than not you must pick and choose what will best support your analysis\r\nsuccinctly and effectively. Again, having the objectives and audience in mind\r\nwill help you decide what’s critical. Lay it all out before you and pair\r\nsub-questions with evidence until you have a story.\r\n\r\nOnce you’ve presented the evidence, explain why it supports (or doesn't support)\r\nyour hypothesis or your objectives. A good analysis also considers alternative\r\nhypotheses or interpretations as well. You’ve already surveyed the hypothesis\r\nspace, so you should be ready-armed to handle contrary evidence. Doing so is\r\nalso a way of anticipating the expectations of your audience and the skepticism\r\nthey may harbor. It’s at this point that it’s most critical to keep in mind your\r\nobjectives and the question you’re addressing with your analysis. Ask how every\r\npiece of evidence you offer takes you one step closer to confirming or\r\ndisproving your hypothesis.\r\n\r\nOTHER TIPS AND TRICKS\r\nVisualize the problem . Seeing is believing. It’s cliched to say in any statement asserting the value\r\nof data visualization, but it’s so incredibly true. This “trick” is so effective\r\nthat I’m going to spend more time talking about it in a future post. If you can\r\nplainly “state” something with a graph or chart, go for it!\r\n\r\n * Shail Jayesh Deliwala visualizes confusion matrices to evaluate and compare model performance.\r\n Read the full notebook here ." /\r\n * Lj Miranda shows the steady rise of carbon emissions in the Philippines. Read the full\r\n notebook here ." /\r\n * 33Vito uses polar coordinates to show the times during the day leveling and\r\n non-leveling characters play World of Warcraft. Read the full notebook here ." /\r\n * Michael Griffiths uses color and variations in transparency to make this table of percentages\r\n more readily interpretable. Read the full notebook here ." /\r\n\r\n * Shail Jayesh Deliwala visualizes confusion matrices to evaluate and compare model performance.\r\n Read the full notebook here ." /\r\n * Lj Miranda shows the steady rise of carbon emissions in the Philippines. Read the full\r\n notebook here ." /\r\n * 33Vito uses polar coordinates to show the times during the day leveling and\r\n non-leveling characters play World of Warcraft. Read the full notebook here ." /\r\n * Michael Griffiths uses color and variations in transparency to make this table of percentages\r\n more readily interpretable. Read the full notebook here ." /\r\n\r\nVariety is the spice of life . And it can liven up your writing (and speaking) as well. For example, use a\r\nmix of short and sweet sentences interspersed among longer, more elaborate ones.\r\nFind where you accidentally used the word “didactic” four times on one page and\r\nchange it up! Related to my first point, use effective variety in types of\r\nvisualizations you employ. Small things like this will keep your readers awake\r\nand interested.\r\n\r\nCheck your work . I don’t like to emphasize this too much because I’m a descriptivist , but make sure your writing is grammatical, fluent, and free of typos. For\r\nbetter or worse, trivial mistakes can discredit you in the eyes of many. I find\r\nthat it helps to read my writing aloud to catch disfluencies.\r\n\r\nGain muscle memory . If you really struggle with transforming your analysis into a form that can\r\nbe shared more broadly, begin by writing anything until writing prose feels as natural as writing code. For example, I actually\r\nsuggest sitting down and copying a report word-for-word. Or even any instance of\r\npersuasive writing. Not to be used as your own in any way (i.e., plagiarism),\r\nbut to remove one more unknown from the equation: what it literally feels like to go through the motions of stringing words and sentences and paragraphs\r\ntogether to tell a story.\r\n\r\nCONCLUSIONS & NEXT STEPS\r\nA good analysis is repetitive. You know the intricacies of your work in and out,\r\nbut your audience does not. You’ve told your readers in your abstract (or\r\nintroduction, if you prefer) what you had ventured to do and even what you end\r\nup finding and the content lays this all out for them. In the conclusions\r\nsection you hit them with it again. At this point, they’ve seen the relevant\r\ndata you’ve carefully chosen to support your theory so it’s time to formally\r\ndraw your conclusions. Your readers can decide if they agree or not.\r\n\r\nSpeaking of being repetitive, after making your conclusions, you again remind\r\nyour readers of the objective(s) of this report. Restate them again and help\r\nyour readers help you―what do you expect now? What feedback would you like? What\r\ndecision-making can happen now that your report is presented and the insights\r\nhave been shared? In my work, I often collaborate with strategists to develop a\r\nset of recommendations for our clients. Typically I'll take a stab at it based\r\non the expertise I've gained in working with the data and a strategist will\r\nrefine using their business insights.\r\n\r\nFIN\r\nAnd this is exactly where the beauty of the analysis and your skillful\r\npresentation thereof meet. Because you’ve managed to package your approach in a\r\nfashion digestible to your audience, your readers, collaborators, and clients\r\nhave comprehended and learned from your analysis and what its implications are\r\nwithout getting lost in the trees. They are equipped to react to the value in\r\nyour work and participate in the next step of realizing its objectives.\r\n\r\n\r\n--------------------------------------------------------------------------------\r\n\r\nThanks for reading the second entry in this series on communicating data science . I covered the basics of presenting an analysis at a very high level. I'd love\r\nto learn what your approach is, how you realize the value in your work, and how\r\nyou collaborate with others to achieve business goals. Leave a comment or send me a note !\r\n\r\nIf you missed my interview with Tyler Byers, a data scientist and storytelling\r\nexpert, check it out here . Stay tuned to learn some data visualization fundamentals.\r\n\r\ncommunicating data science data analysis Reporting Tutorial writing * Liling TanGricean maxims should be "maxims" of quantity, quality, relation and manner,\r\n not "maximums" =)\r\n \r\n * Megan RisdalHaha, wow! I don't know how I did that. Fixed. Thank you! 🙂\r\n \r\n \r\n * \r\n \r\n \r\n * \r\n * Albert CampsVery interesting, thanks!!! Trying to summarize it ended up being quite long\r\n anyway. A lot of distilled information.\r\n \r\n We work a bit differently. We include an executive summary + recommendations\r\n at the beginning of the presentation instead of putting them at the end, just\r\n after stating the question to answer. After that the audience knows what will\r\n come, and when the presentation is revisited it is a lot faster to check . If\r\n there's a need to dig deeper, there's always available all the analysis\r\n steps.\r\n \r\n Hoping to see the next one soon! 😀\r\n \r\n * Megan RisdalThanks! I actually often do the same thing re: executive summaries in my\r\n day job, too! That's a really good point. There's definitely no\r\n one-size-fits-all approach which makes a high-level summarization\r\n misleading in certain ways. And now that I think of it, another strength\r\n in communicating data science is being able to be information dense &\r\n concise for times where you need to fit your work into a standalone one-\r\n or two-sheeter/executive summary.\r\n \r\n Hopefully more good stuff coming soon. 🙂\r\n \r\n \r\n * \r\n \r\n \r\n * \r\n\r\nTHE OFFICIAL BLOG OF KAGGLE.COM\r\nSearchCATEGORIES\r\n * Data Science News (38)\r\n * Kaggle News (120)\r\n * Kernels (22)\r\n * Tutorials (28)\r\n * Winners' Interviews (174)\r\n\r\nWANT TO SUBSCRIBE?\r\nEmail Address * First Name * = required fieldPOPULAR TAGS\r\nAlgo Trading Challenge Annual Santa Competition binary classification community computer vision CrowdFlower Search Results Relevance Dark Matter Deloitte diabetes Diabetic Retinopathy EEG data Elo Chess Ratings Competition Eurovision Challenge Facebook Recruiting Flavours of Physics: Finding τ → μμμ Flight Quest Grasp-and-Lift EEG Detection Heritage Health Prize How Much Did It Rain? image classification Intel Kaggle InClass Kernels logistic regression March Mania Merck multiclass classification natural language processing optimization problem Otto Product Classification Owen Zhang Practice Fusion Product Product News Profiling Top Kagglers Recruiting regression problem scikit-learn scripts of the week The Hunt for Prohibited Content Tourism Forecasting Tutorial video series Wikipedia Challenge XGBoostARCHIVES\r\nArchives Select Month July 2016 June 2016 May 2016 April 2016 March 2016 February 2016 January 2016 December 2015 November 2015 October 2015 September 2015 August 2015 July 2015 June 2015 May 2015 April 2015 March 2015 February 2015 January 2015 December 2014 November 2014 September 2014 August 2014 July 2014 June 2014 May 2014 April 2014 March 2014 February 2014 January 2014 December 2013 November 2013 September 2013 August 2013 July 2013 June 2013 May 2013 April 2013 March 2013 February 2013 January 2013 December 2012 November 2012 October 2012 September 2012 August 2012 July 2012 June 2012 May 2012 April 2012 March 2012 February 2012 January 2012 December 2011 November 2011 October 2011 September 2011 August 2011 July 2011 June 2011 May 2011 April 2011 March 2011 February 2011 January 2011 December 2010 November 2010 October 2010 September 2010 August 2010 July 2010 June 2010 May 2010 April 2010 Toggle the Widgetbar See the forest, see the trees. Here lies the challenge in both performing and presenting an analysis. As data scientists, analysts, and machine learning engineers faced with fulfilling business obj… Communicating data science: A guide to presenting your work Live 1

3. Use the cells below to find:

a. The number of unique articles that have an interaction with a user.
b. The number of unique articles in the dataset (whether they have any interactions or not).
c. The number of unique users in the dataset. (excluding null values)
d. The number of user-article interactions in the dataset.

In [35]:
unique_articles = len(df['article_id'].unique())
print("The number of unique articles that have at least one interaction: {}".format(unique_articles))
total_articles = df_content.shape[0]
print("The number of unique articles on the IBM platform: {}".format(total_articles))
unique_users = len(df['email'].dropna().unique())
print("The number of unique users: {}".format(unique_users))
user_article_interactions = df.shape[0]
print("The number of user-article interactions: {}".format(user_article_interactions))
The number of unique articles that have at least one interaction: 714
The number of unique articles on the IBM platform: 1051
The number of unique users: 5148
The number of user-article interactions: 45993

4. Use the cells below to find the most viewed article_id, as well as how often it was viewed. After talking to the company leaders, the email_mapper function was deemed a reasonable way to map users to ids. There were a small number of null values, and it was found that all of these null values likely belonged to a single user (which is how they are stored using the function below).

In [36]:
most_viewed_article_id = str(float(series_grouped1.idxmax()))
print("The most viewed article in the dataset, id as a string with one value following the decimal: {}".
      format(most_viewed_article_id))
max_views = max(series_grouped1.values)
print("The most viewed article in the dataset was viewed how many times? answer: {}".format(max_views))
The most viewed article in the dataset, id as a string with one value following the decimal: 1429.0
The most viewed article in the dataset was viewed how many times? answer: 937
In [37]:
# This will be helpful for later parts of the notebook
# Run this cell to map the user email to a user_id column and remove the email column

def email_mapper():
    coded_dict = dict()
    cter = 1
    email_encoded = []
    
    for val in df['email']:
        if val not in coded_dict:
            coded_dict[val] = cter
            cter+=1
        
        email_encoded.append(coded_dict[val])
    return email_encoded

email_encoded = email_mapper()
del df['email']
df['user_id'] = email_encoded

# show header
df.head()
Out[37]:
article_id title user_id
0 1430 using pixiedust for fast, flexible, and easier data analysis and experimentation 1
1 1314 healthcare python streaming application demo 2
2 1429 use deep learning for image classification 3
3 1338 ml optimization using cognitive assistant 4
4 1276 deploy your python model as a restful api 5
In [38]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45993 entries, 0 to 45992
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   article_id  45993 non-null  int64 
 1   title       45993 non-null  object
 2   user_id     45993 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 1.1+ MB
In [39]:
def get_ordered_article_usercounts(df=df):
    '''
    Creates a new dataframe grouped by article id's and descending ordered user counts.
    
    Input:
        df - modified dataframe with attributes 'article_id', 'title', 'user_id'
    Output:
        df - descending ordered dataframe with attributes 'article_id' and 'user_count'
    '''
    
    # create a new ordered, grouped dataframe with this modified df file
    srs_grpd_dfmod = df.groupby(['article_id'])['user_id'].count()
    
    # create the new ordered dataframe
    index_list = list(range(0, len(srs_grpd_dfmod)))
    interaction_dict = {
        'article_id': list(srs_grpd_dfmod.index),
        'user_count': list(srs_grpd_dfmod.values),
    }
    df_mod_ordered = pd.DataFrame(data=interaction_dict,
                          columns=['article_id', 'user_count'], index=index_list)
    df_mod_ordered = df_mod_ordered.nlargest(len(srs_grpd_dfmod), 'user_count')
    
    
    return df_mod_ordered
In [40]:
df_ordered = get_ordered_article_usercounts(df=df)
df_ordered.head(10)
Out[40]:
article_id user_count
699 1429 937
625 1330 927
701 1431 671
697 1427 643
652 1364 627
614 1314 614
600 1293 572
526 1170 565
518 1162 512
608 1304 483
In [41]:
# get the first 5 article ids
df_ordered['article_id'][0:5].to_list()
Out[41]:
[1429, 1330, 1431, 1427, 1364]
In [42]:
df.query('article_id == 1429')
Out[42]:
article_id title user_id
2 1429 use deep learning for image classification 3
6 1429 use deep learning for image classification 7
41 1429 use deep learning for image classification 3
75 1429 use deep learning for image classification 7
80 1429 use deep learning for image classification 40
... ... ... ...
45147 1429 use deep learning for image classification 3968
45153 1429 use deep learning for image classification 3968
45156 1429 use deep learning for image classification 3968
45190 1429 use deep learning for image classification 5077
45741 1429 use deep learning for image classification 5138

937 rows × 3 columns

In [43]:
## If you stored all your results in the variable names above, 
## you shouldn't need to change anything in this cell

sol_1_dict = {
    '`50% of individuals have _____ or fewer interactions.`': median_val,
    '`The total number of user-article interactions in the dataset is ______.`': user_article_interactions,
    '`The maximum number of user-article interactions by any 1 user is ______.`': max_views_by_user,
    '`The most viewed article in the dataset was viewed _____ times.`': max_views,
    '`The article_id of the most viewed article is ______.`': most_viewed_article_id,
    '`The number of unique articles that have at least 1 rating ______.`': unique_articles,
    '`The number of unique users in the dataset is ______`': unique_users,
    '`The number of unique articles on the IBM platform`': total_articles,
}

# Test your dictionary against the solution
t.sol_1_test(sol_1_dict)
It looks like you have everything right here! Nice job!

Part II: Rank-Based Recommendations

Unlike in the earlier lessons, we don't actually have ratings for whether a user liked an article or not. We only know that a user has interacted with an article. In these cases, the popularity of an article can really only be based on how often an article was interacted with.

1. Fill in the function below to return the n top articles ordered with most interactions as the top. Test your function using the tests below.

In [44]:
def get_top_articles(n, df=df):
    '''
    Returns the top n article titles.
    
    Input:
        n - (int) the number of top articles to return
        df - (pandas dataframe) df as defined at the top of the notebook 
    
    Output:
        top_articles - (list) A list of the top 'n' article titles    
    '''
    top_articles = []
    
    # get the n article ids
    n_art_ids = get_top_article_ids(n, df)
    
    # list of article titles: get first occurence and separate the title string from object item
    for a_id in n_art_ids:
        title = str(df.query('article_id == {}'.format(a_id)).head(1)['title'])
        title = title.replace("\nName: title, dtype: object", "").partition("    ")[2]
        top_articles.append(title)
    
    return top_articles # Return the top article titles from df (not df_content)


def get_top_article_ids(n, df=df):
    '''
    Returns the top n article id numbers.
    
    Input:
        n - (int) the number of top articles to return
        df - (pandas dataframe) df as defined at the top of the notebook 
    
    Output:
        top_articles_ids - (list) A list of the top 'n' article id's     
    '''
    
    df_ordered = get_ordered_article_usercounts(df=df)
    top_articles_ids = df_ordered['article_id'][0:n].to_list()
 
    return top_articles_ids # Return the top article ids
In [45]:
print("\033[1mGet Top Article Titles List:\033[0m")
print(get_top_articles(10))
print("\n\033[1mGet Top Article IDs List:\033[0m")
print(get_top_article_ids(10))
Get Top Article Titles List:
['use deep learning for image classification', 'insights from new york car accident reports', 'visualize car data with brunel', 'use xgboost, scikit-learn & ibm watson machine learning apis', 'predicting churn with the spss random tree algorithm', 'healthcare python streaming application demo', 'finding optimal locations of new store using decision optimization', 'apache spark lab, part 1: basic concepts', 'analyze energy consumption in buildings', 'gosales transactions for logistic regression model']

Get Top Article IDs List:
[1429, 1330, 1431, 1427, 1364, 1314, 1293, 1170, 1162, 1304]
In [46]:
# Test your function by returning the top 5, 10, and 20 articles
top_5 = get_top_articles(5)
top_10 = get_top_articles(10)
top_20 = get_top_articles(20)

# Test each of your three lists from above
t.sol_2_test(get_top_articles)
Your top_5 looks like the solution list! Nice job.
Your top_10 looks like the solution list! Nice job.
Your top_20 looks like the solution list! Nice job.

Part III: User-User Based Collaborative Filtering

1. Use the function below to reformat the df dataframe to be shaped with users as the rows and articles as the columns.

  • Each user should only appear in each row once.
  • Each article should only show up in one column.
  • If a user has interacted with an article, then place a 1 where the user-row meets for that article-column. It does not matter how many times a user has interacted with the article, all entries where a user has interacted with an article should be a 1.
  • If a user has not interacted with an item, then place a zero where the user-row meets for that article-column.

Use the tests to make sure the basic structure of the matrix matches what is expected by the solution.

In [47]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45993 entries, 0 to 45992
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   article_id  45993 non-null  int64 
 1   title       45993 non-null  object
 2   user_id     45993 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 1.1+ MB
In [48]:
df.head()
Out[48]:
article_id title user_id
0 1430 using pixiedust for fast, flexible, and easier data analysis and experimentation 1
1 1314 healthcare python streaming application demo 2
2 1429 use deep learning for image classification 3
3 1338 ml optimization using cognitive assistant 4
4 1276 deploy your python model as a restful api 5
In [49]:
df.groupby(['user_id', 'article_id'])['article_id'].count()
Out[49]:
user_id  article_id
1        43            1
         109           1
         151           1
         268           1
         310           2
                      ..
5146     1394          1
         1416          1
5147     233           1
5148     1160          1
5149     16            1
Name: article_id, Length: 33682, dtype: int64
In [50]:
user_item = df.groupby(['user_id', 'article_id'])['article_id'].count().unstack()
user_item
Out[50]:
article_id 0 2 4 8 9 12 14 15 16 18 ... 1434 1435 1436 1437 1439 1440 1441 1442 1443 1444
user_id
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN 1.0 NaN 1.0 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN ... NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5145 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5146 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5147 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5148 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5149 NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5149 rows × 714 columns

In [51]:
user_item.shape
Out[51]:
(5149, 714)
In [52]:
type(user_item)
Out[52]:
pandas.core.frame.DataFrame
In [53]:
user_item.iloc[0,0]
Out[53]:
nan
In [54]:
np.isnan(user_item.iloc[0,0])
Out[54]:
True
In [55]:
user_item.iloc[0, 0] = 0
user_item
Out[55]:
article_id 0 2 4 8 9 12 14 15 16 18 ... 1434 1435 1436 1437 1439 1440 1441 1442 1443 1444
user_id
1 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN 1.0 NaN 1.0 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN ... NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5145 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5146 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5147 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5148 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5149 NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5149 rows × 714 columns

In [56]:
type(user_item.iloc[0, 0])
Out[56]:
numpy.float64
In [57]:
user_item.columns[713]
Out[57]:
1444
In [58]:
user_item.shape[1]-1
Out[58]:
713
In [59]:
# shall be value of user_id=3, article_id=12 which is: 1.0 > 0 = True
print("article_id: {}".format(user_item.columns[5]))
user_item.iloc[2, 5] > 0
article_id: 12
Out[59]:
True
In [60]:
def create_user_item_matrix(df):
    '''
    Creates the user-article matrix with 1's and 0's.
    
    Input:
        df - pandas dataframe with article_id, title, user_id columns
    
    Output:
        user_item - (dataframe) user item matrix 
    
    Description:
    Return a matrix with user id's as rows and article id's on the columns with 1 values where
    a user interacted with an article and a 0 otherwise.
    '''
    
    # Create user-by-item matrix, the user_item is a dataframe object
    user_item = df.groupby(['user_id', 'article_id'])['article_id'].count().unstack()
    
    # future toDo: more python like creation with lambda function
    for i in range(0, user_item.shape[0]):         # user id row iteration
        for j in range(0, user_item.shape[1]):     # article id column iteration
            if np.isnan(user_item.iloc[i, j]):
                user_item.iloc[i, j] = 0
            elif user_item.iloc[i, j] > 0:
                #print("i: {}, j: {}, i-j pos value: {}".format((i), j, user_item.iloc[i,j]))
                user_item.iloc[i, j] = 1
    
    # return the user_item matrix with integer values; the added values are by default float dtype
    return user_item.astype('int64')
In [61]:
user_item = create_user_item_matrix(df)
user_item
Out[61]:
article_id 0 2 4 8 9 12 14 15 16 18 ... 1434 1435 1436 1437 1439 1440 1441 1442 1443 1444
user_id
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 0 1 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 1 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5145 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
5146 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
5147 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
5148 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
5149 0 0 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0

5149 rows × 714 columns

In [62]:
## Tests: You should just need to run this cell.  Don't change the code.
assert user_item.shape[0] == 5149, "Oops!  The number of users in the user-article matrix doesn't look right."
assert user_item.shape[1] == 714, "Oops!  The number of articles in the user-article matrix doesn't look right."
assert user_item.sum(axis=1)[1] == 36, "Oops!  The number of articles seen by user 1 doesn't look right."
print("You have passed our quick tests!  Please proceed!")
You have passed our quick tests!  Please proceed!

2. Now, we want ot find similar users. The function below takes a user_id and provides an ordered list of the most similar users to that user (from most similar to least similar). The returned result does not contain the provided user_id, as we know that each user is similar to him/herself.

Regarding the similarity of users, we don't have ranking information by now. Therefore, we can use the user-article interaction only. Because the results for each user here are binary, it makes sense to compute similarity as the dot product of two users.

In [63]:
def find_similar_users(user_id, user_item=user_item):
    '''
    Searches similar users to the given one.
    
    Input:
        user_id - (int) a user_id
        user_item - (pandas dataframe) matrix of users by articles: 
                    1's when a user has interacted with an article, 0 otherwise
    
    Output:
        similar_users - (list) an ordered list where the closest users (largest dot product users)
                        are listed first
    
    Description:
    Computes the similarity of every pair of users based on the dot product and 
    returns an ordered list. 
    If the user_id does not exist in the user-item matrix, the return list is empty.   
    '''
    
    # regarding assert tests: if a (float) string value is given as user_id, create an int out of it
    user_id = int(float(user_id))
    
    most_similar_users = []
    
    if user_id in user_item.index:
        # compute similarity of each user to the provided user,
        # don't use np.dot() because we want to sort the values afterwards which is easier with pandas
        similarity = user_item.dot(user_item.loc[user_id])

        # sort by similarity
        similarity = similarity.sort_values(ascending=False)

        # remove the own user's id
        similarity.drop(user_id, inplace=True)

        # create list of just the ids
        most_similar_users = similarity.index.to_list()   
       
    # return a list of the users in order from most to least similar
    return most_similar_users      
In [64]:
# Do a spot check of your function
print("The 10 most similar users to user 1 are: {}".format(find_similar_users(1)[:10]))
print("The 5 most similar users to user 3933 are: {}".format(find_similar_users(3933)[:5]))
print("The 3 most similar users to user 46 are: {}".format(find_similar_users(46)[:3]))
The 10 most similar users to user 1 are: [3933, 23, 3782, 203, 4459, 131, 3870, 46, 4201, 5041]
The 5 most similar users to user 3933 are: [1, 23, 3782, 4459, 203]
The 3 most similar users to user 46 are: [4201, 23, 3782]

3. Now that we have a function that provides the most similar users to each user, we want to use these users to find articles we would recommend to each user.

In [65]:
def get_article_names(article_ids, df=df):
    '''
    Returns the article titles of the given article id's.
    
    Input:
        article_ids - (list) a list of article id's
        df - (pandas dataframe) df as defined at the top of the notebook
    
    Output:
        article_names - (list) a list of article names associated with the list of article id's 
                        (this is identified by the title column); 
                        the list is empty if the article id's don't exist.
    '''
    article_names = []
       
    # list of article titles: get first occurence and separate the title string from object item
    for a_id in article_ids:
        title = str(df.query('article_id == {}'.format(a_id)).head(1)['title'])
        title = title.replace("\nName: title, dtype: object", "").partition("    ")[2]
        article_names.append(title)
    
    # Return the article names associated with list of article ids
    return article_names  
In [66]:
# 2 examples from below assert part
get_article_names([1024.0, 1176.0])
Out[66]:
['using deep learning to reconstruct high-resolution audio',
 'build a python app on the streaming analytics service']
In [67]:
user = user_item.query("user_id == {}".format(1))
user
Out[67]:
article_id 0 2 4 8 9 12 14 15 16 18 ... 1434 1435 1436 1437 1439 1440 1441 1442 1443 1444
user_id
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 0 1 0 0 0 0 0

1 rows × 714 columns

In [68]:
series_user_items = user_item.loc[1]
series_user_items
Out[68]:
article_id
0       0
2       0
4       0
8       0
9       0
       ..
1440    0
1441    0
1442    0
1443    0
1444    0
Name: 1, Length: 714, dtype: int64
In [69]:
# index positions of values being 1
np.where(series_user_items.isin([1]))
Out[69]:
(array([ 22,  54,  76, 123, 138, 147, 152, 178, 222, 235, 253, 267, 285,
        315, 329, 374, 401, 408, 438, 526, 538, 540, 566, 600, 609, 651,
        656, 665, 672, 678, 697, 699, 700, 701, 706, 708], dtype=int64),)
In [70]:
# get series part with values being 1, having the expected index
series_user_items[series_user_items.isin([1])]
Out[70]:
article_id
43      1
109     1
151     1
268     1
310     1
329     1
346     1
390     1
494     1
525     1
585     1
626     1
668     1
732     1
768     1
910     1
968     1
981     1
1052    1
1170    1
1183    1
1185    1
1232    1
1293    1
1305    1
1363    1
1368    1
1391    1
1400    1
1406    1
1427    1
1429    1
1430    1
1431    1
1436    1
1439    1
Name: 1, dtype: int64
In [71]:
def get_user_articles(user_id, user_item=user_item):
    '''
    Returns the list of article id's and its associated titles for a given user.
    
    Input:
        user_id - (int) a user id
        user_item - (pandas dataframe) matrix of users by articles: 
                    1's when a user has interacted with an article, 0 otherwise
    
    Output:
        article_ids - (list) a list of the article ids seen by the user
        article_names - (list) a list of article names associated with the list of article id's 
                        (this is identified by the doc_full_name column in df_content)
    
    Description:
    Provides a list of the article_ids and article titles that have been seen by a user.
    The lists are empty if the given user id does not exist in the user-item matrix.
    '''
    # regarding assert tests: if a (float) string value is given as user_id, create an int out of it
    user_id = int(float(user_id))
    
    article_ids = []
    article_names = []
    
    if user_id in user_item.index:
        series_user_items = user_item.loc[user_id]
        article_ids = series_user_items[series_user_items.isin([1])].index.to_list()
        article_names = get_article_names(article_ids)
    
    # return the ids and names
    #print(type(article_ids[0]), type(article_names[0])) - throws the output: <class 'int'> <class 'str'>
    return article_ids, article_names 
In [72]:
get_user_articles(20)
Out[72]:
([232, 844, 1320],
 ['self-service data preparation with ibm data refinery',
  'use the cloudant-spark connector in python notebook',
  'housing (2015): united states demographic measures'])
In [73]:
def user_user_recs(user_id, m=10):
    '''
    Returns a list of 'm' article id's recommended to the given user which the user has not seen before.
    
    Input:
        user_id - (int) a user id
        m - (int) the number of recommendations you want for the user
    
    Output:
        recs - (list) a list of recommendations for the user
    
    Description:
    Loops through the users based on closeness to the input user_id.
    For each user - finds articles the user hasn't seen before and provides them as recs.
    Does this until m recommendations are found.
    
    Notes:
    Users who are the same closeness are chosen arbitrarily as the 'next' user.
    
    For the user where the number of recommended articles starts below m 
    and ends exceeding m, the last items are chosen arbitrarily.  
    
    If the given user id does not exist in the user-item matrix, available in the background,
    the output recs list is empty.
    '''
    
    # regarding assert tests: if a (float) string value is given as user_id, create an int out of it
    user_id = int(float(user_id))
    
    recs = []
    similar_users = find_similar_users(user_id)
    user_articles_ids, user_articles_names = get_user_articles(user_id)
    
    for similar_user in similar_users:
        similar_user_articles_ids, _ = get_user_articles(similar_user)
        if len(recs) < m:
            # subtract is not possible because the arrays have different length
            # movie_ids_diffs = np.subtract(user_articles_ids, similar_user_articles_ids)
            # so see: https://www.w3resource.com/python-exercises/numpy/python-numpy-exercise-20.php
            article_ids_diffs = np.setdiff1d(similar_user_articles_ids, user_articles_ids)
            # see: https://docs.scipy.org/doc/numpy/reference/generated/numpy.unique.html#numpy.unique
            recs = np.unique(np.concatenate([recs, article_ids_diffs], axis=0))
    
    if len(recs) > m:
        recs = recs[0:m]       
    
    # return your recommendations for this user_id 
    return recs   
In [74]:
# Check Results
# Return 10 recommendations for user 1
get_article_names(user_user_recs(1, 10))
Out[74]:
['this week in data science (april 18, 2017)',
 'timeseries data analysis of iot events by using jupyter notebook',
 'got zip code data? prep it for analytics. – ibm watson data lab – medium',
 'higher-order logistic regression for large datasets',
 'using machine learning to predict parking difficulty',
 'deep forest: towards an alternative to deep neural networks',
 'experience iot with coursera',
 'using brunel in ipython/jupyter notebooks',
 'graph-based machine learning',
 'the 3 kinds of context: machine learning and the art of the frame']
In [75]:
# Test your functions here - No need to change this code - just run this cell
#  Note:
#  In this part Udacity wrote id's as strings for the set() function,
#  which leads the asserts to fail (throws AssertionError).
#  So, I changed this id strings to float numbers in this assert part!
#  The get_user_articles() gets the same result, but with different order, its id-title mapping is correct.
#  Therefore I tested the third assert statement with the given number order and a changed order.
#  Both resulted in the same assert answer to be correct.
#
assert set(get_article_names(['1024.0', '1176.0', '1305.0', '1314.0', '1422.0', '1427.0'])) == set(['using deep learning to reconstruct high-resolution audio', 'build a python app on the streaming analytics service', 'gosales transactions for naive bayes model', 'healthcare python streaming application demo', 'use r dataframes & ibm watson natural language understanding', 'use xgboost, scikit-learn & ibm watson machine learning apis']), "Oops! Your the get_article_names function doesn't work quite how we expect."
assert set(get_article_names(['1320.0', '232.0', '844.0'])) == set(['housing (2015): united states demographic measures','self-service data preparation with ibm data refinery','use the cloudant-spark connector in python notebook']), "Oops! Your the get_article_names function doesn't work quite how we expect."
assert set(get_user_articles(20)[0]) == set([1320.0, 232.0, 844.0])  #set(['1320.0', '232.0', '844.0'])
assert set(get_user_articles(20)[1]) == set(['housing (2015): united states demographic measures', 'self-service data preparation with ibm data refinery','use the cloudant-spark connector in python notebook'])
assert set(get_user_articles(2)[0]) == set([1024.0, 1176.0, 1305.0, 1314.0, 1422.0, 1427.0])
assert set(get_user_articles(2)[1]) == set(['using deep learning to reconstruct high-resolution audio', 'build a python app on the streaming analytics service', 'gosales transactions for naive bayes model', 'healthcare python streaming application demo', 'use r dataframes & ibm watson natural language understanding', 'use xgboost, scikit-learn & ibm watson machine learning apis'])
print("If this is all you see, you passed all of our tests!  Nice job!")
If this is all you see, you passed all of our tests!  Nice job!

4. Now we are going to improve the consistency of the user_user_recs function from above.

  • Instead of arbitrarily choosing when we obtain users who are all the same closeness to a given user - choose the users that have the most total article interactions before choosing those with fewer article interactions.
  • Instead of arbitrarily choosing articles from the user where the number of recommended articles starts below m and ends exceeding m, choose articles with the articles with the most total interactions before choosing those with fewer total interactions. This ranking should be what would be obtained from the top_articles function we wrote earlier.
In [76]:
df['user_id'].value_counts()
Out[76]:
23      364
3782    363
98      170
3764    169
203     160
       ... 
1039      1
3150      1
1103      1
3182      1
2049      1
Name: user_id, Length: 5149, dtype: int64
In [77]:
type(df['user_id'].value_counts())
Out[77]:
pandas.core.series.Series
In [78]:
(df['user_id'].value_counts()).drop(labels=[23])
Out[78]:
3782    363
98      170
3764    169
203     160
4459    158
       ... 
1039      1
3150      1
1103      1
3182      1
2049      1
Name: user_id, Length: 5148, dtype: int64
In [79]:
def get_top_sorted_users(user_id, df=df, user_item=user_item):
    '''
    Returns a dataframe with ordered similar users to the given user id and
    their similarity and article interaction count information.
    
    Input:
        user_id - (int) a user id
        df - (pandas dataframe) df as defined at the top of the notebook 
        user_item - (pandas dataframe) matrix of users by articles: 
                1's when a user has interacted with an article, 0 otherwise
    
            
    Output:
        neighbors_df - (pandas dataframe) a dataframe with:
                        neighbor_id - is a neighbor user_id
                        similarity - measure of the similarity of each user to the provided user_id
                        num_interactions - the number of articles viewed by the user;
                        if the user_id does not exist the output dataframe includes the column labels only.
                    
    Other Details - sort the neighbors_df by the similarity and then by number of interactions where 
                    highest of each is higher in the dataframe   
    '''
    
    # regarding assert tests: if a (float) string value is given as user_id, create an int out of it
    user_id = int(float(user_id))
    
    # for the resulting neighbors_df we have to calculated the specific column parts seperatly
    # and cannot user the find_similar_users() function
    neighbors_df = pd.DataFrame(columns=['neighbor_id', 'similarity', 'num_interactions']) 
    
    if user_id in user_item.index:
        # compute similarity of each user to the provided user
        similarity = user_item.dot(user_item.loc[user_id])  
        # sort by similarity, remove the own user's id
        similarity = similarity.sort_values(ascending=False).drop(user_id)

        # compute the interaction article numbers of the given user,
        # remove the own user's id from the series object instance
        num_interaction = (df['user_id'].value_counts()).drop(labels=user_id)

        df_neighbors_dict = {
            'similarity': similarity,
            'num_interactions': num_interaction,
        }

        neighbors_df = pd.DataFrame(df_neighbors_dict)
        neighbors_df = neighbors_df.sort_values(by=['similarity', 'num_interactions'], ascending=False)
        neighbors_df = neighbors_df.reset_index()
        neighbors_df = neighbors_df.rename(columns={'index': 'neighbor_id'})
        
    
    # Return the dataframe specified in the doc_string
    return neighbors_df
In [80]:
# example for user id 1  (used for testing below)
example_user_1 = get_top_sorted_users(1)
example_user_1
Out[80]:
neighbor_id similarity num_interactions
0 3933 35 45
1 23 17 364
2 3782 17 363
3 203 15 160
4 4459 15 158
... ... ... ...
5143 5141 0 1
5144 5144 0 1
5145 5147 0 1
5146 5148 0 1
5147 5149 0 1

5148 rows × 3 columns

In [81]:
# example 2
example_user_notExist = get_top_sorted_users(6000)
print("dataframe size for empty df shall be zero and it is: {}\nDataframe structure is:\n".
      format(example_user_notExist.size))
example_user_notExist
dataframe size for empty df shall be zero and it is: 0
Dataframe structure is:

Out[81]:
neighbor_id similarity num_interactions
In [82]:
def user_user_recs_part2(user_id, m=10):
    '''
    Returns an ordered list of 'm' article id's recommended to the given user
    which the user has not seen before.
    
    Input:
        user_id - (int) a user id
        m - (int) the number of recommendations you want for the user
    
    Output:
        recs - (list) a list of recommendations for the user by article id's
        rec_names - (list) a list of recommendations for the user by article title
    
    Description:
    Loops through the users based on closeness to the input user_id.
    For each user - finds articles the user hasn't seen before and provides them as recs.
    Does this until m recommendations are found.
    
    Notes:
    * Choose the users that have the most total article interactions 
    before choosing those with fewer article interactions.

    * Choose articles with the articles with the most total interactions 
    before choosing those with fewer total interactions. 
    '''
    
    # regarding assert tests: if a (float) string value is given as user_id, create an int out of it
    user_id = int(float(user_id))
    
    recs = []    
    user_articles_ids, user_articles_names = get_user_articles(user_id)
    
    if len(user_articles_ids) == 0:  # user id does not exist in user-item matrix
        recs = get_top_article_ids(m)
        rec_names = get_article_names(recs)
    else:  # user id is part of the user-item matrix
        df_neighbours = get_top_sorted_users(user_id)
        neighbours_ids = list(df_neighbours['neighbor_id'].values)
        #print('neighbours:\n{}'.format(neighbours_ids))
        
        for neigh_id in neighbours_ids: #[:4]
            neigh_id_articles_ids, _ = get_user_articles(neigh_id)
            if len(recs) < m:
                # see: https://www.w3resource.com/python-exercises/numpy/python-numpy-exercise-20.php
                article_ids_diffs = np.setdiff1d(neigh_id_articles_ids, user_articles_ids)
                #print('diffs ids: {}'.format(article_ids_diffs))
                # see: https://docs.scipy.org/doc/numpy/reference/generated/numpy.unique.html#numpy.unique
                recs = np.unique(np.concatenate([recs, article_ids_diffs], axis=0))
    
    #print('len of recs: {}'.format(len(recs)))
    #print(recs)
    
    if len(recs) > m:
        recs = recs[0:m]  
        
    rec_names = get_article_names(recs)
    
    return recs, rec_names
In [83]:
# Quick spot check
rec_ids, rec_names = user_user_recs_part2(20, 10)
print("\033[1mThe top 10 recommendations for user 20 are the following article ids:\033[0m")
print(rec_ids)
print()
print("\033[1mThe top 10 recommendations for user 20 are the following article names:\033[0m")
print(rec_names)
The top 10 recommendations for user 20 are the following article ids:
[ 12. 109. 125. 142. 164. 205. 302. 336. 362. 465.]

The top 10 recommendations for user 20 are the following article names:
['timeseries data analysis of iot events by using jupyter notebook', 'tensorflow quick tips', 'statistics for hackers', 'neural networks for beginners: popular types and applications', 'learn tensorflow and deep learning together and now!', "a beginner's guide to variational methods", 'accelerate your workflow with dsx', 'challenges in deep learning', 'dsx: hybrid mode', 'introduction to neural networks, advantages and applications']

5. We use functions from above to correctly fill in the solutions to the dictionary below and test our dictionary against the solution.

In [84]:
get_top_sorted_users(131)['neighbor_id'].head(10).tail(1).values[0]  # value shall be 242
Out[84]:
242
In [85]:
### Tests with a dictionary of results

# Find the user that is most similar to user 1
user1_most_sim = example_user_1.head(1)['neighbor_id'].values[0]
print(user1_most_sim, type(user1_most_sim))

# Find the 10th most similar user to user 131
user131_10th_sim = get_top_sorted_users(131)['neighbor_id'].head(10).tail(1).values[0] 
print(user131_10th_sim, type(user131_10th_sim))
3933 <class 'numpy.int64'>
242 <class 'numpy.int64'>
In [86]:
## Dictionary Test Here
sol_5_dict = {
    'The user that is most similar to user 1.': user1_most_sim, 
    'The user that is the 10th most similar to user 131': user131_10th_sim,
    }

t.sol_5_test(sol_5_dict)
This all looks good!  Nice job!

6. If a new user is given, which of the above functions would we be able to use to make recommendations? Is there a better way we might make recommendations?

Answer:
This recommendation project implementation does not include ratings from users to the article items. By now, only user-article interaction information is given. So, no predicted rating can be used to recommend articles to the new user. Additionally, this new user does not have any own article interactions, therefore no user-user similarity recommendation from the platform is possible as well. This shortcoming of a new user is called cold- start problem. The other option for this issue would be a new article item.

Therefore, as recommendation proposal a new user can only get best ranked article interactions from the whole dataset. This can be classified as a kind of collaborative filtering, because we work with the collaboration of users. If there are many users who interacted with an article then that item can be recommended to the new user who hasn’t seen that article yet. But we shall have in mind, that this is an implicit assumption that simplifies the situation very much.

This kind of interaction ranked recommendation for a new user includes the higher risk, that this recommendation may not fit to the new users interests or taste or article usage goals. In other words, the delivered recommendation could be completely wrong regarding the new user.

Furthermore in general, the associated user-item matrix is an extremely sparse one, because users don't rate or interact with all available items (articles). So, the matrix will have a high sparsity value (sparcity = number-of-empty-cells/number-of-all cells). In this project case, we are dealing with user-article interactions and its boolean result (0 = no interaction, 1 = interaction). For us 'empty' means no interaction value 0. In real recommendation matrices this would the NaN value (null value). Here, this matrix structure could tighten the status of having a wrong recommendation for the new user. We cannot be sure that the existing interaction ranking means that the articles are liked by the interacting users. The real popularity of an article is unclear.

One possible improvement to this whole situation would be to add a content based recommendation concept with the additionaly delivered df_content file. That file delivers document information of the users including sentiment icons. This task could be implemented by NLP (natural language processing) concepts. As a result, the knowledge about the article popularity would be improved very much, so, even having only a very simple ranking recommendation, its result would fit to the new user much better as the existing one.

Nevertheless, this content based concept still does not behave on the fact, that users taste change over time. In other words, a recommendation given now, could be very different some time in the future having content ratings about the same articles from the same users. The same is valid with user-user similarity recommendation systems. So, a better approach for similarity recommendations are item-item similartiy based recommendation systems. Doing that, we have to investigate the articles content too, which is not expected to change significantly over a huge time frame.

Another kind of recommendation systems exists for improvement, we will implement below in a simple form: Matrix Factorisation based recommendation systems. There the gradient descent concept approximates the interaction value cases, so, the system is able to approximate the non-interaction ('empty') matrix cells as well. Afterwards, both cells are very close. With this optimisation process squared loss error values will be minimised and we don't have a sparse user-item matrix anymore.

7. Using your existing functions, provide the top 10 recommended articles you would provide for the a new user below. You can test your function against our thoughts to make sure we are all on the same page with how we might make a recommendation.

In [87]:
new_user = '0.0'

# What would your recommendations be for this new user '0.0'?  As a new user, they have no observed articles.
# Provide a list of the top 10 article ids you would give to 
new_user_recs, new_user_recs_names = user_user_recs_part2(new_user, 10)
print(new_user_recs)
[1429, 1330, 1431, 1427, 1364, 1314, 1293, 1170, 1162, 1304]
In [88]:
# see my note about string id's above in the former assert block
assert set(new_user_recs) == set([1314.0, 1429.0, 1293.0, 1427.0, 1162.0, 1364.0, 1304.0, 1170.0, 1431.0, 1330.0]), "Oops!  It makes sense that in this case we would want to recommend the most popular articles, because we don't know anything about these users."
print("That's right!  Nice job!")
That's right!  Nice job!

Part IV: Matrix Factorization

In this part of the notebook, we will use matrix factorization to make article recommendations to the users on the IBM Watson Studio platform.

1. We have already created a user_item matrix above in question 1 of Part III above. This first question here will just require to run the cells to get things set up for the rest of Part V of this notebook.

In [89]:
# Load the matrix here
user_item_matrix = pd.read_pickle('user_item_matrix.p')
In [90]:
# quick look at the matrix
user_item_matrix.head()
Out[90]:
article_id 0.0 100.0 1000.0 1004.0 1006.0 1008.0 101.0 1014.0 1015.0 1016.0 ... 977.0 98.0 981.0 984.0 985.0 986.0 990.0 993.0 996.0 997.0
user_id
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 714 columns

2. In this situation, we can use Singular Value Decomposition from numpy on the user-item matrix.

In [91]:
# Perform SVD on the User-Item Matrix here,
# use the built in to get the three matrices
u, s, vt = np.linalg.svd(user_item_matrix)
In [92]:
print("The u, s, vt matrices shapes are:")
print(u.shape, s.shape, vt.shape)
The u, s, vt matrices shapes are:
(5149, 5149) (714,) (714, 714)

Note:
We are working with a user-item matrix not having any missing values. It is filled with 0's and 1's for any user article interaction. This condition leads us to the status using the singular value decomposition build-in function from numpy.

3. Now for the tricky part, how do we choose the number of latent features to use? Running the below cell, we see that as the number of latent features increases, we obtain a lower error rate on making predictions for the 1 and 0 values in the user-item matrix.

In [93]:
num_latent_feats = np.arange(10,700+10,20)
sum_errs = []

for k in num_latent_feats:
    # restructure with k latent features
    s_new, u_new, vt_new = np.diag(s[:k]), u[:, :k], vt[:k, :]
    
    # take dot product
    user_item_est = np.around(np.dot(np.dot(u_new, s_new), vt_new))
    
    # compute error for each prediction to actual value
    diffs = np.subtract(user_item_matrix, user_item_est)
    
    # total errors and keep track of them
    err = np.sum(np.sum(np.abs(diffs)))
    sum_errs.append(err)

Regarding this training, we plot an accuracy diagram as known from deep learning projects, a so called learning curve used as performance metric of the model. If available, this metric performance indicator will visualise an improvement of the next concept working on training and test data and not on all data as now.

In [94]:
plt.plot(num_latent_feats, 1 - np.array(sum_errs)/df.shape[0], label='all data')
plt.xlabel('no. of latent features')
plt.ylabel('accuracy')
plt.title('Accuracy vs. Number of Latent Features')
plt.legend(loc='best');

4. From the above, we can't really be sure how many features to use, because simply having a better way to predict the 1's and 0's of the matrix doesn't exactly give us an indication of if we are able to make good recommendations. Instead, we might split our dataset into a training and test set of data, as shown in the cell below.

We use the code from question 3 to understand the impact on accuracy of the training and test sets of data with different numbers of latent features. Using the split below:

  • How many users can we make predictions for in the test set?
  • How many users are we not able to make predictions for because of the cold start problem?
  • How many articles can we make predictions for in the test set?
  • How many articles are we not able to make predictions for because of the cold start problem?
In [95]:
df_train = df.head(40000)
df_test = df.tail(5993)
In [96]:
def create_test_and_train_user_item(df_train, df_test):
    '''
    Creates the training and testing datasets and handles the return parameters with build user-item matrix.
    
    Input:
        df_train - training dataframe
        df_test - test dataframe
    
    Output:
        user_item_train - a user-item matrix of the training dataframe 
                          (unique users for each row and unique articles for each column)
        user_item_test - a user-item matrix of the testing dataframe 
                        (unique users for each row and unique articles for each column)
        test_idx - all of the test user ids
        test_arts - all of the test article ids    
    '''
    
    user_item_train = create_user_item_matrix(df_train)
    user_item_test = create_user_item_matrix(df_test)

    test_idx = user_item_test.index
    test_arts = user_item_test.columns
    
    return user_item_train, user_item_test, test_idx, test_arts
In [97]:
user_item_train, user_item_test, test_idx, test_arts = create_test_and_train_user_item(df_train, df_test)
In [98]:
print("The training and testing datasets structure is:")
print(user_item_train.shape, user_item_test.shape)
The training and testing datasets structure is:
(4487, 714) (682, 574)
In [99]:
# 'How many users can we make predictions for in the test set?' - users must be in both sets;
# result is the value used to get the answer of the second question below: 682 - result of this part
train_idx = user_item_train.index
intersection = list(set(train_idx) & set(test_idx))
print("Intersection idx length of both sets: {}".format(len(intersection)))
Intersection idx length of both sets: 20
In [100]:
# Replace the values in the dictionary below
a = 662 
b = 574 
c = 20 
d = 0 

# my note: I changed the last 2 given dict key strings - replaced 'movies' by 'articles', in test file as well
# but that leads to KeyError even by saving the files and new import ???
sol_4_dict = {
    'How many users can we make predictions for in the test set?': c, 
    'How many users in the test set are we not able to make predictions for because of the cold start problem?': a, 
    'How many articles can we make predictions for in the test set?': b,
    'How many articles in the test set are we not able to make predictions for because of the cold start problem?': d
}

t.sol_4_test(sol_4_dict)
Awesome job!  That's right!  All of the test movies are in the training data, but there are only 20 test users that were also in the training set.  All of the other users that are in the test set we have no data on.  Therefore, we cannot make predictions for these users using SVD.

5. Now, we use the user_item_train dataset from above to find U, S, and V transpose using SVD. Then we find the subset of rows in the user_item_test dataset that we can predict using this matrix decomposition with different numbers of latent features to see how many features makes sense to keep based on the accuracy on the test data. This will require combining what was done in questions 2 - 4.

We explore how well SVD works towards making predictions for recommendations on the test data.

In [101]:
# fit SVD on the user_item_train matrix
# remember: again we don't have missing values, therefore the build-in svd function can be used
u_train, s_train, vt_train = np.linalg.svd(user_item_train)
In [102]:
print("How do our training set matrices u, s, vt look like now?")
print(u_train.shape, s_train.shape, vt_train.shape)
How do our training set matrices u, s, vt look like now?
(4487, 4487) (714,) (714, 714)

Now, we use the training decomposition to predict on our test data.

In [103]:
# create the test matrices for the users we can make predictions for:

# u_test
# we need the intersection of the index values of both sets - train and test matrices - as created above;
# we cannot use that idx list directly (having direct numbers, which may lead to an out-of-bounds IndexError); 
# we use: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-with-isin
# delivering an array with True and False values usable for indexing
intersection_idx = user_item_train.index.isin(test_idx)
u_test = u_train[intersection_idx, :]

# vt_train
# we have to create an intersection of the columns of both sets
intersection_clms = user_item_train.columns.isin(test_arts)
vt_test = vt_train[:, intersection_clms]
In [104]:
print("How do our test set matrices u, vt look like now?")
print(u_test.shape, vt_test.shape)
How do our test set matrices u, vt look like now?
(20, 4487) (714, 574)

We want to explore how the optimisation process of minimising the squared loss error is converging for all dataset combinations - means for training, testing and all data. As known, it depends on the number of latent features.

This recommendation concept works on the 20 identified users which are in the training and in the test datasets. For them we calculate predictions with the train_set SVD and compare this result with the test user-item matrix to get the mentioned error values.

In [105]:
# regarding the accuracy diagram above about all data, we can reduce the range of latent features,
# not realising accuracy improvements with higher values in the above diagram,
# at that diagram, at round about 500 latent features accuracy value 1 is reached.

num_latent_feats = np.arange(10,400+10,10)
all_errs = []
train_errs = []
test_errs = []

for k in num_latent_feats:
    # restructure with k latent features
    s_train_new, u_train_new, vt_train_new = np.diag(s_train[:k]), u_train[:, :k], vt_train[:k, :]
    u_test_new, vt_test_new = u_test[:, :k], vt_test[:k, :]
       
    # take dot product and calculate the overall error
    user_item_train_est = np.around(np.dot(np.dot(u_train_new, s_train_new), vt_train_new))
    user_item_test_est = np.around(np.dot(np.dot(u_test_new, s_train_new), vt_test_new))
    
    # compute error for each prediction to actual value and keep track of them;
    # this time for the training set, we have to use the existing index value numbers of the 20 users
    # as calculated before ('intersection' list of the assert test part above);
    # as error we sum the diffs of sum columns and rows
    train_diffs = np.subtract(user_item_train, user_item_train_est)
    test_diffs = np.subtract(user_item_test.loc[intersection, :], user_item_test_est)
    train_err = np.sum(np.sum(np.abs(train_diffs)))
    train_errs.append(train_err)
    test_err = np.sum(np.sum(np.abs(test_diffs))) 
    test_errs.append(test_err)
  
    # total errors and keep track of them
    all_errs.append(1-((np.sum(user_item_test_est)+np.sum(np.sum(user_item_test)))/(user_item_test.shape[0]*user_item_test.shape[1])))
In [106]:
# plot errors learning curve diagram of all, training and test data
plt.plot(num_latent_feats, all_errs, label='all data')
plt.plot(num_latent_feats, 1 - (np.array(train_errs)/(user_item_train.shape[0]*user_item_train.shape[1])),
         label='train data')
plt.plot(num_latent_feats, 1 - (np.array(test_errs)/(user_item_test.shape[0]*user_item_test.shape[1])),
         label='test data')
plt.xlabel('no. of latent features')
plt.ylabel('accuracy')
plt.title('Accuracy vs. Number of Latent Features (predictions of 20 users)')
plt.legend(loc='best');

6. We comment on the resulting answer of the previous question. Given the circumstances of our results, we discuss what we might do to determine if the recommendations made with any of the above recommendation systems are an improvement to how users currently find articles?

Answer:
In general, having only 20 users for prediction and very small training and testing sample sets as well, the predictions and recommendations of this project implementation are not very confidential. Additionally, no real user article rating exists and the whole cold-start problem is not solved appropriately. For this project task, among other things content based rating is important to get the right popularity of an article. Overall, I would not trust the recommendation system results of this RS implementation.

Regarding metric performance, means the accuracy learning curve above, we see after latent features number round about 225, the training curve is higher compared to the testing one, means then overfitting exists. So, an improvement is expected by configuration of a lower number of latent features. To get the best value a cross-validation concept should be added.

Even having a visible accuracy improvement by using SVD with Matrix Factorisation and lower range of latent features on training and testing sets compared to the accuracy distribution of all data, more advanced, sophisticated strategies should be used to implement such kind of recommender system.

As further improvement, e.g. Gaurav Sharma introduced and explained in his medium blog post How to Build a Recommender System (RS) another key performance indicator and additional hyperparameters to avoid overfitting.

As he wrote in his blog post:
"We just thought of user-bias and item-bias and incorporated both of them into our formulation. This is the advantage of optimization on matrix factorization. It would not be possible on user-user and item-item similarity.

In the actual research paper, which winners of Netflix prize wrote, they extend this further and commented that “ratings by users and for items are time dependent”. Therefore, they made rating, user-bias and item-bias r_ui(t), bu(t) & bi(t) respectively as a function of time."

In [7]:
# see: https://docs.python.org/3/library/subprocess.html
# for creating a .html file of this notebook, 0 is returned after successful conversion
# from subprocess import call
call(['python', '-m', 'nbconvert', 'Recommendations_with_IBM.ipynb'])
Out[7]:
0